Method of identifying prokaryotic gene structure

ABSTRACT

A method of determining a genetic structure which includes a process of, after having predicted coding regions which create a transcription unit, proceeding to determine a translation start codon; a method of determining a genetic structure which includes a process of selecting a plurality of pairs of codons of which the difference between the appearance frequencies and those of the codons which have the reverse complementary sequence within the nucleotide sequences of a plurality of coding regions which have already been determined is great, and of deciding that those coding regions which have a large number of codons for which the frequency at which each pair appears is high are true coding regions; a method of determining a genetic structure where the GC content of a nucleotide sequence exceeds 50%, including a process of deciding as false one for which the first and the third GC content of the codons within a coding region are less than a predetermined value; a program for performing these; a recording medium which can be read in by a computer, upon which this program has been recorded; and furthermore a genetic structure determination system which is based upon a computer which executes this program.

TECHNICAL FIELD

The present invention relates to a method of determining a geneticstructure based upon a nucleotide sequence of a prokaryote (nucleotidesequence means a sequence of DNA or of RNA), to a program for executingthe method, to a computer-readable recording medium on which the programis recorded, and to a system for determining a genetic structure basedupon a computer holding the recording medium.

BACKGROUND ART

Various microbes and various enzymes which are produced by the variousmicrobes are utilized in wide fields of industry, and there is a greatdemand to improve these microbes and to discover new enzymes. Thedecoding of the nucleotide sequence information of living organisms hasaccelerated due to the progress of the automatic fluorescence sequencersin the latter part of the 1980's, and the study of genomes has beengreatly made. Since the entire genome sequence of a bacterium(Haemophilus influenza) was determined for the first time in 1995, atthe present time the entire genome sequences of about 50 varieties ofmicrobes have been determined. Furthermore, there are about 200varieties of microbes whose genome sequence is being determined at thepresent time, so that the genome sequences of more than 250 varieties ofmicrobes are being clarified soon. The field of microbial biotechnologyis breaking through into the post-genome era. In promotion of studiesutilizing a massive amount of genetic information, the development oftechnology for analyzing colossal amounts of microbial genomeinformation at high accuracy and at high speed has become a great demandfor research and development activities.

A first step in the decoding of genome information is to determine agenetic structure from the determined nucleotide sequence information.If it is possible to determine the genetic structure, in particular theposition of the coding regions (Coding sequence: CDS) (hereinafter acoding region of a gene is referred to as a coding region), then itbecomes possible to predict functions of the gene products, since it ispossible to predict their amino acid sequences. Furthermore, if it ispossible to predict the structure of a transcription unit such as apolycistron, then it is possible to predict an expression controlmechanism for a group of genes which are upon the same transcriptionunit. Methods of predicting the genetic structure from the nucleotidesequence information are basic and important techniques, and variousmethods have been developed up till now. When predicting a geneticstructure from nucleotide sequence information using a computer, if thetotal number of “correct” structures, in other words “true” structures,is termed N_(T), the total number of structures predicted by thecomputer is termed N_(S), and the number of structures which arepredicted by the computer and identical with “correct” structures, inother words “true” structures, is termed N_(TP), then N_(TP)/N_(T) istermed the sensitivity, and N_(TP)/N_(S) is termed the specificity. Thecloser both the sensitivity and the specificity approach to a numericalvalue “1”, the more excellent is the performance of the programconsidered to be. Accordingly, it is a great demand for development of aprogram which exhibits excellent performance in both sensitivity andspecificity.

The characteristics of the coding regions of a prokaryote were extractedby using a stochastic process such as Markov model and the like, andprograms for determining the coding regions by using these extractedcharacteristics were developed, for example, GenMark [Borodovsky, M. &McIninch, J.: Computers Chem. Vol. 17, 123-133 (1993)], GenMark.hmm[Lukashin, A. V. & Borodovsky, M.: Nucleic Acids Research, Vol. 26,1107-1115 (1998), and Besemer, J. & Borodovsky, M.: Nucleic AcidsResearch Vol. 27, 3911-3920 (1999)], and Glimmer [Salzberg, S. et al.:Nucleic Acids Research Vol. 26, p. 544-548 (1998), and Delcher, A. L. etal.: Nucleic Acids Research Vol. 27, p. 4636-4641 (1999)], and the like.Among these, Glimmer is the most widely used program in the world.Furthermore, the programs CRITICA [Badger, J. H. & Losen, G. J. et al.:Mol. Biol. Evol. Vol. 16, p. 512-524 (1999)] which determines the codingregions based upon homology analysis, and ORPHELUS [Frishman, M. et al.:Nucleic Acids Research Vol. 26, p. 2941-2947 (1998)] which determinesthe coding regions based upon the existence of ribosome bindingsequences and codon usage analysis, were developed. However, theaccuracy of coding region determination of these programs is notsufficiently high, and it is desired to develop a technique fordetermining coding regions with higher accuracy. Furthermore, veryrecently, the program GenMarkS (Besemer, J., Lomsadze, A. & Borodovsky,M.: Nucleic Acids Research Vol. 29, 2607-2618 (2001)], which determinesthe coding regions at high accuracy by enhancing the accuracy ofdetermination of the translation start codons, was also developed. Bycombination with the above described GenMark.hmm, in the determinationof the coding regions from 8 microbal genomes, this program has anaccuracy with sensitivity of 0.969 or more and specificity of 0.865 ormore. However, a program whose accuracy in specificity is even higher isrequired, and also there is a great demand of development of a programwith high accuracy in both sensitivity and specificity as a singleprogram.

Up till now the living organism which has been analyzed in the greatestdetail by various experimental methods is Escherichia coli K-12 strain,the American study group and the Japanese study group who analyzed theEscherichia coli genome respectively announced its total number of CDSsto be 4289 and 4359. When the Escherichia coli genome is analyzed usingthe CDS determination program Glimmer which is most often utilized, itis possible to find about 4158 of the coding regions from 4289 CDSs. Onthe other hand, the total number of CDSs predicted by Glimmer is 5026,which greatly differs from 4289 [Delcher, A. L. et al.: Nucleic AcidsResearch Vol. 27, p. 4636-4641 (1999)]. Also, in the actual process ofannotation of the microbial genome sequence, it takes a long time toexamine the coding regions closely which are tentatively determined byusing these CDS determination programs to determine a genetic structure.In this process of annotation, there is a strong demand for thedevelopment of a technique which has higher accuracy and can shorten thetime for annotation.

The existing programs for determination of the coding regions forprokaryotes have the problem that the accuracy of determination of thecoding regions is insufficient in practice, since the accuracy ofprediction of the position of the translation start codon (hereinafterreferred to as the start codon) is low, and the accuracy ofdetermination of the coding regions is low. Existing programs also havethe problem that they can not predict the structure of the polycistronictranscription units. Furthermore, when existing programs fordetermination of the coding regions are utilized, a preprocess isrequired before the execution of the programs for determining the codingregions, since correct coding regions are determined in advance byhomology analysis and the like, and the programs use a method forprediction based upon the information about these coding regions. Due tothis, they have the problem that processes are complicated and the timerequired to determine the coding regions is long. Furthermore, most ofthe programs for determining the coding regions which have beendeveloped up till now have the problem that the accuracy of thedetermination of the coding regions become low when the coding regionsare predicted from a nucleotide sequence in which the content of Gresidues and C residues (hereinafter abbreviated as “GC content”) ishigh. Even the combination of the two previously described programsGenMarkS and GenMark.hmm can not determine the coding regions at highaccuracy from a microbial genome with the high (for example, 65% ormore) GC content.

In order to show these problems specifically, the genetic structure of aprokaryote is explained using FIGS. 1 and 2 (refer to FIGS. 1 and 2). Asshown in FIG. 1, the basic structure of a prokaryote gene consists of apromoter which start the synthesis of mRNA, a ribosome binding sitewhich participate in the binding between mRNA and ribosomes and in thetranslation initiation, a start codon, a translation stop codon(hereinafter referred to as a stop codon), and a terminator whichterminate the synthesis of mRNA. AUG codon is the most appropriate as astart codon, and next GUG codon is used. UUG codon is used rarely. Itwas reported that CUG codon is used very rarely.

Since the start codons and coding regions are determined usually basedupon a DNA sequence, in the present specification, the sequences ofstart codons and stop codons and sequences involved in the binding ofribosomes and mRNA are expressed as DNA sequences appropriately as wellas RNA sequences, unless mentioned specifically. Ribosome binding siteis also called Shine Dalgarno sequence (hereinafter abbreviated as SDsequence), and generally has a sequence complementary to the 3′ terminalof 16S rRNA. In particular, the SD sequence which appears at highfrequency is AGGAGG or AAGGAGG (hereinafter these sequences are referredto as “consensus SD sequences”), or a sequence homologous with“consensus SD sequence”. Although these sequences appear at varioussites of genes, it is understood that the SD sequences appear at highfrequency in regions upstream of start codons. In the presentspecification, in order to specify a “position upstream of a startcodon” clearly, the distance between AGGA (or a sequence whichpositionally identical with AGGA) within a consensus SD sequence (AGGAGGor AAGGAGG) and a start codon (hereinafter, referred to as the distancebetween SD-ATG) is used in the following description. It is known thatthe distance between SD-ATG exerts a great influence upon thetranslation starting efficiency of genes [Shepard, H. M. et al.: DNAVol. 2, p. 125-131 (1982), Itoh, S. et al.: DNA Vol. 2, p. 157-165(1982)].

Next, as shown in FIG. 2, many reading frames of translation which havepossibility to encode proteins (Open Reading Frames: hereinafterabbreviated as “ORF”s) are present. In the present specification, areading frame of translation which has possibility to encode a proteinis termed “ORF”, and, among the ORFs, a coding region from which aprotein is actually translated is termed “CDS” (coding sequence).Furthermore, individual ORF, CDS, coding region or transcription unit isexpressed as ORF-1, ORF-AN ORF-B, CDS-1, CDS-A, CDS-B, coding region A,coding region B, transcription unit P, transcription unit Q, and thelike, wherein a number or symbol is added respectively. In the presentspecification, ORF means the longest ORF in length in many cases. In thepresent specification, as shown in FIG. 2, a DNA strand on the upperside on the figure, in other words a DNA strand upon which genestranslated rightwards are present tandemly, is termed a plus strand, onthe other hand, a DNA strand on the lower side of the figure is termed aminus strand.

In many cases, as shown in FIG. 2, a CDS (a region encoding a trueprotein) is identical with an ORF, or is a region which is a portion ofan ORF. Furthermore, many candidates for start codons also exist, and itis not easy to determine the true start codon. In the case of the genesof a prokaryote, a plurality of CDSs exists upon the same mRNA, andmoreover the individual CDSs, in other words cistrons, are adjacent andform a linked structure. This structure of the transcription unit istermed “polycistronic mRNA”. It is difficult to predict thetranscription unit structure of the polycistrons with approaches ofinformation science, and no program for determination of translationwhich can predict the transcription units with good efficiency and canoutput the results thereof has been developed. Furthermore, if the CDSor the transcription unit overlaps with other CDS or transcription unit,or contains other CDS or transcription unit (in the presentspecification, expressed as “include” or “be in an inclusionrelationship”), it is necessary to decide on the truth or the falsity ofeach of the CDSs or transcription units.

As described above, there is the problem that it is difficult todetermine the correct CDSs for the given nucleotide sequenceinformation, due to reasons such as “the existence of many candidatesfor CDS”, “the existence of many candidates for the start codon”, and“the overlapping of two or more candidates for CDS”.

DISCLOSURE OF THE INVENTION

In order to solve the above described problems, the present invention ismade, wherein its goal is to develop a method of determining a geneticstructure with enhanced accuracy, of which the chief aims are: to makeit possible to predict the structure of polycistronic transcriptionunits; to enhance the accuracy of determination of the positions ofstart codons; to require no necessity for providing information inadvance; and to handle nucleotide sequences whose GC content is high.Thus, the objective of the present invention is to provide a method ofdetermining a genetic structure of a prokaryote, which achieves theseaims, a program for executing the method, a computer-readable recordingmedium on which the program is recorded, and a system for determining agenetic structure based upon a computer holding the recording medium.

A method of determining a genetic structure of the present invention isdirected to a method of determining at least one of the followingmembers: a coding region, the position of a translation start codon, theposition of a translation stop codon, and a transcription unit.

The present inventor has principally found a “method of determining agenetic structure from a viewpoint of a transcription unit structure”, a“method of determining a genetic structure using a shadow discriminationfunction”, and a “method of determining a genetic structure from aviewpoint of the GC content of the bases in the codons”, has producedprograms for executing these methods on a computer, and has found thatit is possible to attain these goals by executing the programs.

In other words, the present invention is a “method of determining agenetic structure from the viewpoint of a transcription unit structure”according to (1)-(6) below.

(1) A method of determining a genetic structure of a prokaryote, whichcomprises the steps (a) to (g) described below:

(a) setting a translation stop codon from information about thenucleotide sequence of a prokaryote (a nucleotide sequence is a sequenceof DNA or RNA), and setting a provisional translation start codon whichyields the longest open reading frame (hereinafter abbreviated as ORF)based upon said translation stop codon;

(b) deciding that the ORF-A and the ORF-B have a possibility to form asingle transcription unit if the provisional start codon of the ORF-A isupstream of the translation stop codon of the ORF-B, or is within D_(S)bases downstream of said translation stop codon [herein D_(S) is aninteger from 20 to 100], wherein any two neighboring ORFs which areobtained in the step (a) and present on the same strand are termed ORF-Aand ORF-B from downstream;

(c) determining that the candidate for the translation start codon isthe translation start codon of ORF-A if the ORF-A and the ORF-B aredecided to have a possibility to form a single transcription unit in thestep (b) and if the candidate for the translation start codon is presentwithin a region (hereinafter termed the “vicinity of the translationstop codon”) between D_(B) bases downstream from the first T (thymidine)residue of the translation stop codon of the ORF-B and U_(B) basesupstream from said T residue [herein D_(B) is an integer between 10 and20, and U_(B) is an integer between 3 and 15], and determining thetranslation start codon of the ORF-A from a priority ranking determinedby using the distance between each candidate and the translation stopcodon of the ORF-B as an indicator if there is a plurality ofcandidates;

(d) examining whether a candidate for the translation start codon of theORF-A is present within a region (hereinafter termed the “region aroundthe vicinity of the translation stop codon”) between R_(D) basesdownstream from the first T residue of the translation stop codon of theORF-B and R_(U) bases upstream from said T residue and excluding said“vicinity of the translation stop codon” [herein R_(D) is an integerfrom 30 to 120, and R_(U) is an integer from 20 to 120] if thetranslation start codon of the ORF-A can not be determined in the step(c);

(e) examining whether a ribosome binding site is present from 1 to 30bases upstream of a candidate for the translation start codon of theORF-A if the candidate is present in the region around the vicinity ofthe translation stop codon in the step (d), determining its ribosomebinding sequence if such a ribosome binding site is present, anddetermining that the candidate which corresponds to said ribosomebinding sequence is the translation start codon of the ORF-A;

(f) searching for up to the number N of candidates for the translationstart codon including the provisional start codon which yields thelongest ORF from the 5′ terminal of an ORF-A which is not decided tohave a possibility to form a single transcription unit in the step (b)or whose translation start codon is not determined in the step (e),investigating whether a ribosome binding site is present from 1 to 30bases upstream of each candidate, determining its ribosome bindingsequence if such a ribosome binding site is present, and determiningthat the candidate which corresponds to said ribosome binding sequenceis the translation start codon [herein N is an integer from 5 to 20];

(g) confirming the positions of the translation start codon and thetranslation stop codon, the coding region, and the transcription unitsfrom the results of determination by the step (c), the step (e) or thestep (f) to determine a genetic structure.

(2) The method of determining a genetic structure according to (1),wherein the step (e) is a step of determining the translation startcodon of an ORF-A by the following steps:

determining that a MRNA sequence whose ribosome binding score exceeds athreshold value V₃, described below, is a ribosome binding sequence[herein V₃ is an integer from 4 to 12], wherein the paired state betweena mRNA sequence of 4 to 17 bases upstream of a candidate for thetranslation start codon of the ORF-A and a sequence (3′-UUCCUCC-5′)involved in the binding to mRNA in a 16S rRNA 3′ terminal sequence, orbetween a mRNA sequence of 4 to 16 bases upstream of said candidate anda sequence (3′-UCCUCC-5′) involved in the binding to mRNA in a 16S rRNA3′ terminal sequence, is expressed as a numerical value, which is termeda “score which shows the binding state between mRNA and a ribosome”(hereinafter termed a ribosome binding score), according to the fourrules described below:

-   -   (i) A pairing of G and C yields +4;    -   (ii) A pairing of A and U yields +2;    -   (iii) A pairing of G and U yields +1;    -   (iv) When no pairing is present at a base pair which is adjacent        to a base pair where a pairing is present, then this yields −1;

determining that the candidate which corresponds to said ribosomebinding sequence is the translation start codon;

dividing the “region of an ORF-B around the vicinity of the stop codon”into the two of “the region downstream of said vicinity” and “the regionupstream of said vicinity” if there is a plurality of said translationstart codons, and determining the one of said translation start codonswhich has the highest priority is the true translation start codon basedon the priority of “the region downstream of said vicinity” and “theregion upstream of said vicinity” in that order;

determining the translation stop codon of the ORF-A from a priorityranking defined by using the distance from the translation stop codon ofthe ORF-B as an indicator if a plurality of translation start codons ispresent within the respective regions.

(3) The method of determining a genetic structure according to (1) or(2), wherein the step (f) is a step of determining the translation startcodon of an ORF-A by the following steps:

determining that the mRNA sequence whose ribosome binding score exceedsa threshold value V, described below, ₁ is a ribosome binding sequence,wherein the paired state between a mRNA sequence of from 4 to 17 basesupstream of a candidate for the translation start codon of the ORF-A anda sequence (3′-UUCCUCC-5′) involved in the binding to mRNA in a 16S rRNA3′ terminal sequence, or between a mRNA sequence of 4 to 16 basesupstream of said candidate and a sequence (3′-UCCUCC-5′) involved in thebinding to mRNA in a 16S rRNA 3′ terminal sequence, is expressed as anumerical value, termed “ribosome binding score”, according to the fourrules described below:

-   -   (i) A pairing of G and C yields +4;    -   (ii) A pairing of A and U yields +2;    -   (iii) A pairing of G and U yields +1;    -   (iv) When no pairing is present at a base pair which is adjacent        to a base pair where a pairing is present, then this yields −1;

determining that the candidate which corresponds to said ribosomebinding sequence is the translation start codon;

determining that the translation start codon corresponding to theribosome binding sequence which yields the highest score is the truetranslation start codon if there is a plurality of said translationstart codons;

setting one or more threshold value(s) smaller than V₁, which includethe threshold value V₃, if there is no candidate which exceeds thethreshold value V₁, and determining the translation start codon of theORF-A in a stepwise manner if said threshold value is exceeded [hereinV₁ is an integer which is greater than the V₃ of (2), and which isbetween 7 and 14].

(4) The method of determining a genetic structure according to (2) or(3), wherein the “ribosome binding score” is calculated by deducting anumerical value P_(G) if the translation start codon is GTG, or bydeducting a numerical value P_(T) if the translation start codon is TTG[herein P_(G) is an integer from 1 to 4, and P_(T) is an integer from 2to 6].

(5) A method of determining a genetic structure, wherein a transcriptionunit P, a coding region A, a transcription unit Q, and a coding region Bis determined by utilizing the method according to any one of (1) to(4), which further comprises the steps (h) to (j) described below if thetranscription unit P or the coding region A overlaps with thetranscription unit Q or the coding region B:

(h) deciding that the transcription unit Q or the coding region B is a“false transcription unit” or a “false coding region” if a transcriptionunit Q or a coding region B which is present upon the same strand as atranscription unit P or a coding region A is included in thetranscription unit P or the coding region A;

(i) deciding that the transcription unit Q or the coding region B is a“false transcription unit” or a “false coding region” if a transcriptionunit Q or a coding region B which is present upon the complementarystrand to a transcription unit P or a coding region A is included in thetranscription unit P or the coding region A;

(j) deciding that the transcription unit or coding region whose lengthis shorter is a “false transcription unit” or a “false coding region”when a transcription unit P or a coding region A overlaps with atranscription unit Q or a coding region B which is present upon thecomplementary strand.

(6) A method of determining a genetic structure, wherein the method ofdetermining a genetic structure according to any one of (1) to (5) isutilized repeatedly.

Furthermore, the present invention is a “method of determining a geneticstructure using a shadow discrimination function” as described in(7)-(11) below.

(7) A method of determining a genetic structure of a prokaryote, whichcomprises the steps (k) and (1) described below:

(k) selecting k types of combination of codons wherein “the frequency ofappearance of one codon is high and the frequency of appearance of acodon which has the complementary sequence to the 3-base sequence ofsaid codon is low” in a plurality (the number T) of determined codingregions of the prokaryote;

(l) comparing the “number of times of the k types of codons whosefrequency of appearance is high appearing in a coding region A which isassumed to be a coding region” with the “number of times of the k typesof codons whose frequency of appearance is low appearing in said codingregion A”, and deciding on the truth or falsity of said coding region A[herein k is an integer greater than or equal to 5 and less than orequal to 20].

(8) The method of determining a genetic structure according to (7),wherein the method for comparing the “number of times of the k types ofcodons whose frequency of appearance is high appearing in a codingregion A which is assumed to be a coding region” with the “number oftimes of the k types of codons whose frequency of appearance is lowappearing in said coding region A” is a method which involves using “thereciprocal of the sum of 1 and the ratio of the number of the latter tothe number of the former” as a calculation formula and which involvesdeciding that said coding region A is a “false coding region” if thevalue of said reciprocal is less than a fixed value.

(9) The method of determining a genetic structure according to (7),which is based on the nucleotide sequence of the number T of determinedcoding regions of the prokaryote and comprises the steps (m) to (p)described below:

(m) arranging the 64 types of codons so that the 3-base sequence of thei-th codon has the complementary sequence to the nucleotide sequence ofthe (i+32)-th codon;

(n) obtaining y_(i) from the formula (2) below and Y_(i+32) from theformula (3) below: $\begin{matrix}{y_{i} = {\left( {{\sum\limits_{i = 1}^{T}\quad C_{i}^{t}} - {\sum\limits_{i = 1}^{T}\quad C_{i + 32}^{t}}} \right)/{\sum\limits_{i = 1}^{T}\quad{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}} & (2) \\{y_{t + 32} = {\left( {{\sum\limits_{i = 1}^{T}\quad C_{i + 32}^{t}} - {\sum\limits_{i = 1}^{T}\quad C_{i}^{t}}} \right)/{\sum\limits_{i = 1}^{T}\quad{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}} & (3)\end{matrix}$

wherein the number of appearances of the i-th codon in the t-th codingregion is expressed asC^(t) _(j)

(o) rearranging the 64 types of codon in the step (m) in descendingorder of the y_(i) and the Y_(i+32), selecting top k types of codons forwhich the value of y_(i) or of Y_(i+32) is large, and obtaining thevalue of Sd_(A) for a coding region A by the following formula (4):$\begin{matrix}\left. {{Sd}_{A} = {{2 \times {\sum\limits_{i = 1}^{k}\quad{C_{i}^{A}/{\underset{i = 1}{\overset{k}{\left( \sum \right.}}\quad C_{i}^{A}}}}} + {\sum\limits_{i = {65 - k}}^{64}\quad C_{i}^{A}}}} \right) & (4)\end{matrix}$

[herein the value of Sd_(A) is defined as 1 if$\left( {{\sum\limits_{i = 1}^{k}\quad C_{i}^{A}} + {\sum\limits_{i = {65 - k}}^{64}\quad C_{i}^{A}}} \right)$is zero]

(p) deciding that a coding region A is a true coding region if the valueof Sd_(A) of said coding region calculated in the process (o) is greaterthan or equal to a threshold value S₁, and that it is a false codingregion if said value of Sd_(A) is less than the threshold value S₁[herein T is an integer greater than or equal to 2, i is a positiveinteger less than or equal to 32, j is a positive integer less than orequal to 64, t is a positive integer less than or equal to T, k is aninteger from 5 to 20, and S₁ is a value from 0.8 to 1.8].

(10) A method of determining a genetic structure of a prokaryote, whichcomprises the steps (q) and (r) described below, wherein a coding regionof the prokaryote or a coding region A which is assumed to be a codingregion overlaps with a coding region B which is assumed to be a codingregion and present upon the complementary strand, and said coding regionB is included in said coding region A:

(q) comparing the length L_(B) (in base pairs) of said coding region Bwith the length L_(A) (in base pairs) of said coding region A, anddeciding that said coding region B is a “false coding region” if L_(B)is less than or equal to T_(P) % of L_(A);

(r) deciding on the truth or falsity of said coding region A and of saidcoding region B by the method according to any one of (7) to (9) ifL_(B) exceeds T_(P) % of L_(A) [herein, T_(P) is a positive integer from30 to 95].

(11) A method of determining a genetic structure, characterized byremoving the translation stop codons from the coding regions which forma transcription unit, and linking up the resulting coding regions into asingle coding region, before utilizing the method according to any oneof (7) to (10).

Furthermore, the present invention, as specified by (12) below, enhancesthe accuracy of determination of the coding regions of a prokaryote bycombining a “method of determining a genetic structure from theviewpoint of a transcription unit structure” and a “method ofdetermining a genetic structure using a shadow discrimination function”.

(12) A method of determining a genetic structure, which comprises:

deciding on the truth or falsity of a coding region or of atranscription unit which is determined by the method of determining agenetic structure according to any one of (1) to (6), by utilizing themethod of determining a genetic structure according to any one of (7) to(11).

Furthermore, the present invention is specified by (13) below.

(13) A method of determining a genetic structure, which comprises:

deciding on the truth or falsity of a coding region which encodes apolypeptide of L_(M) amino acids or more in length, by using the methodof determining a genetic structure according to any one of (7) to (12),based on the nucleotide sequence of a coding region which is determinedby using the method of determining a genetic structure according to anyone of (1) to(12) and which encodes a polypeptide of L_(F) amino acidsor more in length [herein L_(F) is a positive integer greater than orequal to 100, and L_(M) is a positive integer greater than or equal to20].

Furthermore, the present invention is a “method of determining a geneticstructure from the viewpoint of the GC content of the bases in thecodons”, as shown in (14) to (18) below.

(14) A method of determining a genetic structure of a prokaryote,characterized by deciding that a coding region in the nucleotidesequence of the prokaryote is a “false coding region” if the GC contentof said nucleotide sequence is greater than 50% and if a content,calculated by utilizing a calculation formula which yields a content ofthe first and third G residues and C residues of the codons in saidnucleotide sequence, is less than a fixed value.

(15) The method of determining a genetic structure according to (14),wherein the following formula (5) is used as a calculation formula, thevalue of GC_(i) described below is used as a calculated content, and onevalue which is selected from 0.6 to 0.75 is used as a fixed value:$\begin{matrix}{{GC}_{i} = {{\left( {y^{i{(1)}} + y^{i{(3)}}} \right)/{\sum\limits_{r = 1}^{3}\quad{y^{i{(r)}}\quad{wherein}\quad y^{i{(r)}}}}} = {\sum\limits_{n = 1}^{N_{1}}\quad{\sum\limits_{b = 1}^{4}\quad x_{n{(b)}}^{i{(r)}}}}}} & (5)\end{matrix}$

[herein when the r-th base (r is 1, 2, or 3) of the n-th codon of thei-th coding region is b (b is 1, 2, 3, or 4), then $\begin{matrix}x_{n{(b)}}^{i{(r)}} & {is} & {x_{n{(b)}}^{i{(r)}} = 1} & \left( {b = {1\quad{or}\quad 2}} \right) \\\quad & \quad & {x_{n{(b)}}^{i{(r)}} = 0} & \left( {b = {3\quad{or}{\quad\quad}4}} \right)\end{matrix}$

and, as for b, when the r-th base of the n-th codon of the i-th codingregion is G, C, A, or T, b is 1, 2, 3, or 4, respectively, i and n arepositive integers, and N_(i) denotes the total number of the codons(excluding the translation stop codon) of the i-th coding region].

(16) A method of determining a genetic structure of a prokarvote, whichcomprises:

deciding that a coding region in the nucleotide sequence of theprokaryote is a “false coding region” if the GC content of saidnucleotide sequence is greater than 50%, and if a content, calculated byutilizing a calculation formula which yields a content of the first andthird G residues and C residues of the codons in said nucleotidesequence, is less than a fixed value; and

re-searching for a translation start codon which is present downstreamof said translation start codon which is decided to be false.

(17) The method of determining a genetic structure according to (16),wherein the following formula (5) is used as a calculation formula, thevalue of GC_(i) described below is used as a calculated content, and onevalue which is selected from 0.6 to 0.75 is used as a fixed value:$\begin{matrix}{{GC}_{i} = {{\left( {y^{i{(1)}} + y^{i{(3)}}} \right)/{\sum\limits_{r = 1}^{3}\quad{y^{i{(r)}}\quad{wherein}\quad y^{i{(r)}}}}} = {\sum\limits_{n = 1}^{N_{1}}\quad{\sum\limits_{b = 1}^{4}\quad x_{n{(b)}}^{i{(r)}}}}}} & (5)\end{matrix}$

[herein, when the r-th base (r is 1, 2, or 3) of the n-th codon of thei-th coding region is b (b is 1, 2, 3, or 4), then $\begin{matrix}x_{n{(b)}}^{i{(r)}} & {is} & {x_{n{(b)}}^{i{(r)}} = 1} & \left( {b = {1\quad{or}\quad 2}} \right) \\\quad & \quad & {x_{n{(b)}}^{i{(r)}} = 0} & \left( {b = {3\quad{or}{\quad\quad}4}} \right)\end{matrix}$

and, as for b, when the r-th base of the n-th codon of the i-th codingregion is G, C, A, or T, b is 1, 2, 3, or 4, respectively, i and n arepositive integers, and N_(i) denotes the total number of the codons(excluding the translation stop codon) of the i-th coding region].

(18) A method of determining a genetic structure of a prokaryote whoseGC content in the nucleotide sequence exceeds 50%, wherein the method ofdetermining a genetic structure according to any one of (1) to (13) andthe method of determining a genetic structure according to any one of(14) to (17) are utilized.

Furthermore, the present invention is specified by (19) to (24) below.

(19) A method of determining a genetic structure of a prokaryote,wherein the method of determining a genetic structure according to anyone-of (1) to (18) and a “method of deciding on the truth or falsity ofa coding region by utilizing a coding potential” are utilized.

(20) The method of determining a genetic structure according to (19),wherein said “method of deciding on the truth or falsity of a codingregion by utilizing a coding potential” is a method of deciding on thetruth or falsity of the coding region A described below by, based uponthe nucleotide sequences of the number T of the determined codingregions of the prokaryote, comparing the “number of times of m types ofcodons whose frequency of appearance is high appearing in the codingregion A which is assumed to be the coding region” with the “number oftimes of m types of codons whose frequency of appearance is lowappearing in the coding region A” for the number T of coding regions[herein, T is an integer greater than or equal to 2, and m is an integergreater than or equal to 5 and less than or equal to 20].

(21) The method of determining a genetic structure according to (20),wherein the method of comparing the “number of times of m types ofcodons whose frequency of appearance is high appearing in the codingregion A which is assumed to be the coding region” and the “number oftimes of m types of codons whose frequency of appearance is lowappearing in the coding region A” is a method which involves utilizingthe “reciprocal of the sum of 1 and the ratio of the number of thelatter to the number of the former” as a calculation formula, and whichdecides that said coding region A is a “false coding region” if thevalue of said reciprocal is less than a fixed value [herein m is aninteger greater than or equal to 5 and less than or equal to 20].

(22) The method of determining a genetic structure according to (20),which comprises the steps (s) to (u) described below:

(s) obtaining y_(i) from the following formula (6): $\begin{matrix}{y_{i} = {\sum\limits_{i = 1}^{T}\quad{C_{i}^{t}/{\sum\limits_{i = 1}^{T}\quad{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}}} & (6)\end{matrix}$

wherein the number of times of the i-th codon appearing in the t-thcoding region is expressed asC^(t) _(j)

(t) rearranging the 64 types of codon in descending order of y_(i),selecting “top m codons for which the value of y_(i) is large” and“bottom m codons for which the value of y_(i) is large, excluding thetranslation stop codon”, and obtaining the value of Cd_(A) for thecoding region A which is assumed to be the coding region from thefollowing formula (7): $\begin{matrix}{{Cd}_{A} = {2 \times {\sum\limits_{i = 1}^{m}\quad{C_{i}^{A}/\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{61}\quad C_{i}^{A}}} \right)}}}} & (7)\end{matrix}$

[herein the value of Cd_(A) is defined as 1 if$\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{61}\quad C_{i}^{A}}} \right)$

is zero];

(u) deciding that said coding region A is a true coding region if thevalue of Cd_(A) for said coding region A which is calculated in the step(t) is greater than or equal to a threshold value CV, and deciding thatit is a false coding region if said value of Cd_(A) is less than thethreshold value CV [herein T is an integer greater than or equal to 2; iis a positive integer less than or equal to 64; j is a positive integerless than or equal to 64; t is a positive integer less than or equal toT, m is an integer from 5 to 20; and CV is a value from 0.8 to 1.8].

(23) A method of determining a genetic structure, which comprises thesteps (v) and (w) described below if a coding region of the prokaryoteor a coding region A which is assumed to be a coding region overlapswith a coding region B which is assumed to be a coding region andpresent upon the complementary strand, and if said coding region B isincluded in said coding region A:

(v) comparing the length L_(B) (in base pairs) of said coding region Bwith the length L_(A) (in base pairs) of said coding region A, anddeciding that said coding region B is a “false coding region” if L_(B)is less than or equal to T_(P) % of L_(A);

(w) deciding on the truth or falsity of said coding region A and of saidcoding region B by the method of determining a genetic structureaccording to any one of (18) to (22) if L_(B) exceeds T_(P) % of L_(A)[herein TP is a positive integer from 30 to 95].

(24) A method of determining a genetic structure, characterized byremoving the translation stop codons from the coding regions which forma transcription unit, and linking up the resulting coding regions into asingle coding region, before utilizing the method of determining agenetic structure according to any one of (18) to (23).

Furthermore, the present invention is specified by (25) to (45) below.

(25) A program for executing the following steps on a computer:

(a) finding a translation stop codon in the nucleotide sequence of aprokaryote from the information of said nucleotide sequence inputted viaan input device, searching for a provisional translation start codonwhich yields the longest open reading frame (ORF) for all the obtainedtranslation stop codons to make a candidate for ORF which is thecombination of the said translation stop codon and provisionaltranslation start codon, and storing the position of these codons insaid nucleotide sequence in a memory;

(b) calling up from the memory two adjacent candidates for ORF which arepresent upon the same strand, investigating the positions of theprovisional translation start codon of the downstream side ORF (termedORF-A) and of the translation stop codon of the upstream side ORF(termed ORF-B) and the distance between the ORF-A and the ORF-B; and

deciding that the two adjacent ORFS have a possibility to form a singletranscription unit if the provisional translation start codon of theORF-A is upstream of the translation stop codon of the ORF-B, or iswithin D_(S) bases downstream of said translation stop codon [hereinD_(S) is an integer from 20 to 100], and proceeding to the step (c); or

deciding that the two adjacent ORFs do not form a single transcriptionunit if the distance between the positions of the provisionaltranslation start codon of the ORF-A and of the translation stop codonof the ORF-B does not satisfy the above described condition, andproceeding to the step (f);

(c) calling up the above described nucleotide sequence data for the twoORFs which are decided to have a possibility to form a singletranscription unit in the step (b), and searching for a candidate forthe translation start codon of the ORF-A from a region (hereinaftertermed the “vicinity of the translation stop codon”) between D_(B) basesdownstream from the first T (thymidine) residue of the translation stopcodon of the ORF-B and UB bases upstream from said T residue [here D_(B)is an integer between 10 and 20, and U_(B) is an integer between 3 and15]; and

determining that the ORF-A whose translation start codon is saidcandidate is a true coding region if there is a single candidate for thetranslation start codon, determining that said ORF-A and ORF-B form asingle transcription unit, and writing the results of this determinationinto the memory; or

selecting the candidate whose priority is the highest if there is aplurality of candidates for the translation start codon, wherein thedistance between each candidate and the translation stop codon of theORF-B is used as an indicator of priority, determining that the ORF-Awhose translation start codon is said candidate is a true coding region,and determining that said ORF-A and ORF-B constitute a singletranscription unit, and writing the results of the determination intothe memory;

(d) calling up the above described nucleotide sequence data if thetranslation start codon of the ORF-A can not be determined in the step(c), examining whether a candidate for the translation start codon ofthe ORF-A is present within a region (hereinafter termed the “codingregion around the vicinity of the translation stop codon”) between R_(D)bases downstream from the first T residue of the translation stop codonof the ORF-B and R_(U) bases upstream from said T residue [here R_(D) isan integer from 30 to 120, and R_(U) is an integer from 20 to 120] andexcluding the “vicinity of the translation stop codon”; and

proceeding to the step (e) if a candidate for the translation startcodon of the ORF-A is present in said region, or proceeding to the step(f) if no such candidate is present;

(e) calling up the above described nucleotide sequence data for acandidate for the translation start codon of the ORF-A found in the step(d), examining whether a ribosome binding site is present from 1 to 30bases upstream of each candidate, and determining its ribosome bindingsequence if such a ribosome binding site is present, or determining thatthe ORF-A, whose translation start codon is the candidate whichcorresponds to said ribosome binding sequence, is a true coding region,determining that said ORF-A and ORF-B form a single transcription unit,and writing the results of the determination into the memory;

(f) calling up the above described nucleotide sequence data for an ORF-Awhich is not decided to form a single transcription unit in the step (b)or for an ORF-A whose translation start codon can not be determined inthe step (e), searching for up to the number N of candidates [here N isan integer from 5 to 20] for the translation start codon, including theprovisional start codon which yields the longest ORF, from the 5′terminal, examining whether a ribosome binding site is present from 1 to30 bases upstream of each candidate, determining its ribosome bindingsequence if such a ribosome binding site is present, determining thatthe ORF-A whose translation start codon is the candidate correspondingto said ribosome binding sequence is a true coding region, and writingthe results of the determination into the memory;

(g) repeating the above steps until all of the ORFs stored in the memoryare processed;

outputting, via an output device, the results of determination oftranscription units and coding regions in step (c), (e) or (f), whichhave been stored in the memory.

(26).The program according to (25), wherein the above described step (e)is:

calling up the above described nucleotide sequence data;

calculating a “ribosome binding score” which express the paired statebetween a mRNA sequence of 4 to 17 bases upstream of a candidate for thetranslation start codon of the ORF-A and a sequence (3′-UUCCUCC-5′)involved in the binding to mRNA in a 16S rRNA 3′ terminal sequence, orbetween a mRNA sequence of 4 to 16 bases upstream of said candidate anda sequence (3′-UCCUCC-5′) involved in the binding to mRNA within a 16SrRNA 3′ terminal sequence as a numerical value according to the fourrules described below:

-   -   (1) A pairing of G and C yields +4;    -   (2) A pairing of A and U yields +2;    -   (3) A pairing of G and U yields +1;    -   (4) When no pairing is present at a base pair which is adjacent        to a base pair where a pairing is present, then this yields −1;

maintaining a threshold value V₃ [herein V₃ is an integer from 4 to 12]for said ribosome binding score, determining that the above describedmRNA sequence whose ribosome binding score exceeds a threshold value V3is a ribosome binding sequence, and selecting the translation startcodon which corresponds to said ribosome binding sequence as thetranslation start codon of the ORF-A;

dividing the “region around the vicinity of the translation stop codonof the ORF-B” into the two “the region downstream of said vicinity” and“the region upstream of said vicinity” if there is a plurality of saidtranslation start codons for the ORF-A, and selecting the candidatewhose priority is highest, wherein the order of priority is the first“the region downstream of said vicinity” and the second “the regionupstream of said vicinity”;

selecting the candidate whose priority is highest if a plurality oftranslation start codons is present within the respective regions,wherein the distance from the translation stop codon of the ORF-B isused as an indicator of priority; and

determining that the ORF-A whose translation start codon is the selectedcandidate is a true coding region, determining that said ORF-A and ORF-Bform a single transcription unit, and writing the results of thedetermination into the memory.

(27) The program according to (25) or (26), wherein the above describedstep (f) is:

calling up the above described nucleotide sequence data;

calculating a “ribosome binding score” which express the paired statebetween a mRNA sequence of 4 to 17 bases upstream of a candidate for thetranslation start codon of the ORF-A and a sequence (3′-UUCCUCC-5′)involved in the binding to mRNA in a 16S rRNA 3′ terminal sequence, orbetween a mRNA sequence of 4 to 16 bases upstream of said candidate anda sequence (3′-UCCUCC-5′) involved in the binding to mRNA in a 16S rRNA3′ terminal sequence as a numerical value, according to the four rulesdescribed below:

-   -   (1) A pairing of G and C yields +4;    -   (2) A pairing of A and U yields +2;    -   (3) A pairing of G and U yields +1;    -   (4) When no pairing is present at a base pair which is adjacent        to a base pair where a pairing is present, then this yields −1;

maintaining a threshold value V₁ for said ribosome binding score,determining that the above described mRNA sequence which exceeds thethreshold value V₁ is the ribosome binding sequence, and selecting acandidate for the translation start codon which corresponds to saidribosome binding sequence as the translation start codon of the ORF-A;

selecting the translation start codon corresponding to the ribosomebinding sequence which yields the highest score as the translation startcodon of ORF-A if there is a plurality of said translation start codons;

setting one or more threshold value(s) which is smaller than V₁ andinclude the threshold value V₃ in a stepwise manner if there is nocandidate which exceeds the threshold value V₁, searching for the abovedescribed mRNA sequence whose score exceeds said threshold value in astepwise manner, determining the ribosome binding sequence, and

selecting the translation start codon which corresponds to said ribosomebinding sequence as the translation start codon of the ORF-A; and

determining that the ORF-A whose translation start codon is the selectedcandidate is a true coding region, and writing the results of thedetermination into the memory [herein V₁ is an integer which is greaterthan the V₃ of (2), and which is between 7 and 14].

(28) The program according to (26) or (27), characterized in that theabove described “ribosome binding score” is calculated by deducting anumerical value PG if the translation start codon is GTG, and bydeducting a numerical value P_(T) if the translation start codon is TTG[herein, P_(G) is an integer from 1 to 4, and P_(T) is an integer from 2to 6].

(29) A program for executing the following steps on a computer: callingup the data for transcription units and coding regions stored in thememory after the above described step (g) in the program according to(25) to (28);

(h) deciding that the transcription unit Q or the coding region B is a“false transcription unit” or a “false coding region” if a transcriptionunit Q or a coding region B which is present upon the same strand as atranscription unit P or a coding region A is included in thetranscription unit P or the coding region A;

(i) deciding that the transcription unit Q or the coding region B is a“false transcription unit” or a “false coding region” if a transcriptionunit Q or a coding region B which is present upon the complementarystrand to a transcription unit P or a coding region A is included in thetranscription unit P or the coding region A;

(j) deciding that the transcription unit or coding region whose lengthis shorter is a “false transcription unit” or a “false coding region” ifa transcription unit P or a coding region A overlaps with atranscription unit Q or a coding region B which is present upon thecomplementary strand; and

outputting the results of the above described decision via an outputdevice.

(30) A program for executing the following steps on a computer:

(k) investigating the type of the codons and the number thereof, whichare utilized in a plurality (T) of the coding regions of the prokaryotewhich regions are determined and inputted via an input means, selectingk types of combination of codons among them wherein “the frequency ofappearance of one codon is high, and the frequency of appearance of acodon which has the complementary sequence of the 3-base sequence ofsaid codon is low”, and storing the codons in the memory;

(l) measuring the frequency of appearance of the selected codons in acoding region A which is assumed to be the coding region from the dataof said coding region A inputted via an input means, comparing the“number of times of the k types of codons whose frequency of appearanceis high appearing in a coding region A which is assumed to be a codingregion” with the “number of times of the k types of codons whosefrequency of appearance is low appearing in said coding region A”, anddeciding on the truth or falsity. of said coding region A [herein k isan integer greater than or equal to 5 and less than or equal to 20]; and

displaying the results of the above described decision via an outputdevice.

(31) The program according to (30), wherein the step (1) is comparingthe “number of times of the k types of codons whose frequency ofappearance is high appearing in a coding region A which is assumed to bethe coding region” and the “number of times of the k types of codonswhose frequency of appearance is low appearing in said coding region A”by using “the reciprocal of the sum of 1 and the ratio of the number ofthe latter to the number of the former” as a calculation formula, anddeciding that said coding region A is a “false coding region” if thevalue of said reciprocal is less than a fixed value.

(32) The program for executing the following steps on a computeraccording to (30):

(m) constructing a codon table by arranging the 64 types of codons sothat the 3-base sequence of the i-th codon has the complementarysequence to the nucleotide sequence of the (i+32)-th codon, and storingthe codon table in the memory;

(n) inputting the nucleotide sequence of the number T of determinedcoding regions of a prokaryote, and obtaining yi from the formula (2)below and yi+32 from the formula (3) below: $\begin{matrix}{y_{i} = {\left( {{\sum\limits_{i = 1}^{T}\quad C_{j}^{t}} - {\sum\limits_{t = 1}^{7}\quad C_{i + 32}^{t}}} \right)/{\sum\limits_{i = 1}^{T}\quad{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}} & (2) \\{y_{i + 32} = {\left( {{\sum\limits_{t = 1}^{T}\quad C_{i + 22}^{t}} - {\sum\limits_{i = 1}^{T}\quad C_{j}^{t}}} \right)/{\sum\limits_{t = 1}^{T}\quad{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}} & (3)\end{matrix}$

wherein the number of times the i-th codon appear in the t-th codingregion is expressed asC^(t) _(j)

(o) calling up the codon table which was obtained in the step (m) fromthe memory, setting up a correspondence between the y_(i) and Y_(i+32)for the codons in the table, rearranging the sequence of the codons inthe table in descending order of the y_(i) and the Y_(i+32), selectingtop k codons for which the value of y_(i) or of Y_(i+32) is large, andobtaining the value of Sd_(A) for a coding region A by the followingformula (4): $\begin{matrix}{{Sd}_{A} = {2 \times {\sum\limits_{i = 1}^{k}\quad{C_{i}^{A}/\left( {{\sum\limits_{i = 1}^{k}\quad C_{i}^{A}} + {\sum\limits_{i = {65 - k}}^{64}\quad C_{i}^{A}}} \right)}}}} & (4)\end{matrix}$

[herein the value of Sd_(A) is defined as 1 if$\left( {{\sum\limits_{i = 1}^{k}\quad C_{i}^{A}} + {\sum\limits_{i = {65 - k}}^{64}C_{i}^{A}}} \right)$

is zero];

(p) deciding that said coding region is a true coding region if thevalue of Sd_(A) of a coding region A obtained in the above describedstep is greater than or equal to a threshold value S₁, and deciding thatit is a false coding region if said value of Sd_(A) is less than thethreshold value S₁ [herein T is an integer greater than or equal to 2, iis a positive integer less than or equal to 32, j is a positive integerless than or equal to 64, t is a positive integer less than or equal toT, k is an integer from 5 to 20, and S₁ is a value from 0.8 to 1.8].

(33) A program for executing the following steps on a computer:

examining whether there is mutual overlapping and inclusion betweencoding regions of a prokaryote which are inputted via an input device:

(q) calling up the above described nucleotide sequence data if a codingregion or a coding region A which is assumed to be a coding regionoverlaps with a coding region B which is assumed to be a coding regionand present upon the complementary strand, and if said coding region Bis included in said coding region A, comparing the length L_(B) (in basepairs) of said coding region B with the length L_(A) (in base pairs) ofsaid coding region A, and deciding that said coding region B is a “falsecoding region” if L_(B) is less than or equal to T_(P) % of L_(A);

(r) deciding on the truth or falsity of said coding region A and of saidcoding region B by the steps of the program according to any one of (30)to (32) if L_(B) exceeds T_(P) % of L_(A) (herein T_(P) is a positiveinteger from 30 to 95].

(34) The program according to (33), characterized by rewriting the datafor the determined coding regions to a single coding region constructedby removing the translation stop codons from the coding regions whichform a transcription unit and by linking up the resulting coding regionsfrom said data, before executing the steps (k) and (1) described above.

(35) A program for deciding on the truth or falsity of a coding regionor of a transcription unit which is determined and stored in the memoryin any one of (25) to (35), by the steps of the program according to anyone of (30) to (34).

(36) A program for executing the steps:

calling up the data for coding regions which is determined as truecoding regions by the steps of the program according to any one of (25)to (35) from the memory, calculating the length of the polypeptideencoded by each coding region, and deciding on the truth or falsity ofthe coding regions which encode the polypeptides of L_(M) amino acids ormore in length, by using the program according to any one of (7) to(12), based upon the nucleotide sequences of the coding regions encodingthe polypeptide of L_(F) amino acids or more in length [herein L_(F) isa positive integer greater than or equal to 100, and L_(M) is a positiveinteger greater than or equal to 20].

(37) A program for executing the following steps on a computer:

calculating the content of the first and third G residues and C residuesof the codons in a coding region of a prokaryote whose GC contentexceeds 50% by using a predetermined calculation formula from the datafor said coding region inputted via an input device; deciding that saidcoding region is a “false coding region” if the calculated content isless than a fixed value; and outputting the results of the decision viaan output device.

(38) The program according to (37), wherein the following formula (5) isused as a calculation formula, the value of GC_(i) described below isused as a calculated content, and one value which is selected from 0.6to 0.75 is used as a fixed value: $\begin{matrix}{{GC}_{i} = {{\left( {y^{i{(1)}} + y^{i{(3)}}} \right)/{\sum\limits_{r = 1}^{3}\quad{y^{i{(r)}}\quad{wherein}{\quad\quad}y^{i{(r)}}}}} = {\sum\limits_{n = 1}^{N_{1}}\quad{\sum\limits_{b = 1}^{4}\quad x_{n{(b)}}^{i{(r)}}}}}} & (5)\end{matrix}$

[herein when the r-th base (r is 1, 2, or 3) of the n-th codon of thei-th coding region is b (b is 1, 2, 3, or 4), then$x_{n{(b)}}^{i{(r)}}\quad{is}\quad\begin{matrix}{x_{n{(b)}}^{i{(r)}} = 1} & \left( {b = {1\quad{or}\quad 2}} \right) \\{x_{n{(b)}}^{i{(r)}} = 0} & \left( {b = {3\quad{or}\quad 4}} \right)\end{matrix}$

and, as for b, when the r-th base of the n-th codon of the i-th codingregion is G., C, A, or T, then b is 1, 2, 3, or 4, respectively, i and nare positive integers, and N_(i) denotes the total number of the codons(excluding the translation stop codon) of the i-th coding region].

(39) A program for executing the following steps on a computer:

calculating the content of the first and third G residues and C residuesof the codons of the 5′ terminal region of a coding region of aprokaryote whose GC content exceeds 50% by using a predeterminedcalculation formula, from the data for said coding region inputted viaan input device;

deciding that the translation start codon of said coding region is a“false translation start codon” if the calculated content is less than afixed value, and outputting the results of this decision via an outputdevice;

calling up the nucleotide sequence data of the above described codingregion which is inputted via an input device, and re-searching for atranslation start codon which is present downstream of said translationstart codon decided to be false.

(40) The program according to (39), wherein the following formula (5) isused as an calculation formula, the value of GC_(i) described below isused as a calculated content and one value selected from 0.6 to 0.75 isused as a fixed value: $\begin{matrix}{{GC}_{i} = {{\left( {y^{i{(1)}} + y^{i{(3)}}} \right)/{\sum\limits_{r = 1}^{3}\quad{y^{i{(r)}}\quad{wherein}{\quad\quad}y^{i{(r)}}}}} = {\sum\limits_{n = 1}^{N_{1}}\quad{\sum\limits_{b = 1}^{4}\quad x_{n{(b)}}^{i{(r)}}}}}} & (5)\end{matrix}$

[herein, when the r-th base (r is 1, 2, or 3) of the n-th codon of thei-th coding region is b (b is 1, 2, 3, or 4), then$x_{n{(b)}}^{i{(r)}}\quad{is}\quad\begin{matrix}{x_{n{(b)}}^{i{(r)}} = 1} & \left( {b = {1\quad{or}\quad 2}} \right) \\{x_{n{(b)}}^{i{(r)}} = 0} & \left( {b = {3\quad{or}\quad 4}} \right)\end{matrix}$

and, as for b, when the r-th base of the n-th codon of the i-th codingregion is G, C, A, or T, b is 1, 2, 3, or 4, respectively, i and n arepositive integers, and N_(i) denotes the total number of the codons(excluding the translation stop codon) of the i-th coding region].

(41) A program for executing the following steps on a computer:

selecting m types of codons whose frequency of appearance is high and mtypes of codons whose frequency of appearance is low in the number T ofthe coding regions of a prokaryote which are determined by the steps ofthe program according to any one of (25) to (40), and storing the codonsin the memory;

measuring the “number of times of the m types of codons whose frequencyof appearance in the number T of the coding regions is high appearing inthe coding region A which is assumed to be the coding region” and the“number of times of the m types of codons whose frequency of appearancein the T coding regions is low appearing in said coding region A” fromthe data for coding regions which are not determined, different from thenumber T of the coding regions and inputted; deciding on the truth orthe falsity of said coding region A by comparing both numbers; andoutputting the results of the decision via an output device. [herein Tis an integer greater than or equal to 2, and m is an integer greaterthan or equal to 5 and less than or equal to 20].

(42) The program according to (41), wherein the method of comparing the“number of times of the m types of codons whose frequency of appearanceis high appearing in the coding region A which is assumed to be thecoding region” with the “number of times of the m types of codons whosefrequency of appearance is low appearing in said coding region A” is themethod which utilizes the “reciprocal of the sum of 1 and the ratio ofthe number of the latter to the number of the former” as a calculationformula, and which decides that said coding region A is a “false codingregion” if the value of said reciprocal is less than a fixed value[herein m is an integer greater than or equal to 5 and less than orequal to 20].

(43) The program for executing the following steps on a computeraccording to (41):

(m) constructing a codon table in which the 64 types of codons arearranged so that the 3-base sequence of the i-th codon has acomplementary sequence to the nucleotide sequence of the (i+32)-thcodon, and storing the codon table in the memory;

(s) obtaining y_(i) by the following formula (6): $\begin{matrix}{y_{i} = {\sum\limits_{i - 1}^{T}\quad{C_{i}^{t}/{\sum\limits_{t - 1}^{T}\quad{\sum\limits_{j - 1}^{64}\quad C_{j}^{t}}}}}} & (6)\end{matrix}$

wherein the number of times of the i-th codon appearing in the t-thcoding region is expressed asC^(t) _(j)

(t) calling up the codon table from the memory, rearranging the 64 typesof codon in descending order of y_(i), selecting “top m codons for whichthe value of y_(i) is large” and “bottom m codons for which the value ofy_(i) is large, excluding the translation stop codon”, and obtaining thevalue of Cd_(A) for the coding region A which is assumed to be thecoding region from the following formula (7): $\begin{matrix}{{Cd}_{A} = {2 \times {\sum\limits_{i = 1}^{m}\quad{C_{i}^{A}/\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{61}\quad C_{i}^{A}}} \right)}}}} & (7)\end{matrix}$

[herein the value of Cd_(A) is defined as 1 if$\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{61}\quad C_{i}^{A}}} \right)$

is zero, ] and

(u) deciding said coding region A to be a true coding region if thevalue of Cd_(A) for said coding region A which is calculated by the step(t) is greater than or equal to a threshold value Cv, deciding saidcoding region A to be a false coding region if said value of Cd_(A) isless than the threshold value CV, and outputting the decision resultsvia an output device [herein T is an integer greater than or equal to 2;i is a positive integer less than or equal to 64; j is a positiveinteger less than or equal to 64; t is a positive integer less than orequal to T, m is an integer from 5 to 20; and CV is a value from 0.8 to1.8].

44. A program for executing the following steps on a computer, wherein acoding region of the prokaryote or a coding region A which is assumed tobe a coding region overlaps with a coding region B which is assumed tobe a coding region and present upon the complementary strand, and saidcoding region B is included in said coding region A:

(v) comparing the length L_(B) (in base pairs) of the coding region Bwith the length L_(A) (in base pairs) of the coding region A, anddeciding that the coding region B is a “false coding region” if L_(B) isless than or equal to T_(P) % of L_(A);

(w) deciding on the truth or falsity of said coding region A and of saidcoding region B by the method of determining a genetic structureaccording to any one of (41) to (43) if L_(B) exceeds T_(P) % of L_(A),[herein, T_(P) is a positive integer from 30 to 95]; and

outputting the results of the decision via an output device.

(45) A program executing the following steps:

removing translation stop codons from the coding regions which form atranscription unit from the data for determined coding regions; linkingup the resulting coding regions into a single coding region; andrewriting the data for determined coding regions to the resulting singlecoding region; and executing the steps of the program according to anyone of (41) to (44).

(46) A computer-readable recording medium on which the program accordingto any one of claim 25 to claim 45 is recorded.

(47) A system for determining a genetic structure which comprises:

(i) an input means for inputting nucleotide sequence data;

(ii) a means for executing the program according to any one of claim 25to claim 45, using the inputted data; and

(iii) an output device for outputting the results which is obtained by(ii).

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a figure showing the genetic structure of a prokaryote. In thepresent specification, the distance between the AGGA sequences whichappear at high frequency in the ribosome binding sequences in thedrawing and the start codon, is defined as the distance between the SDsequence and the start codon.

FIG. 2 is a figure showing the structure of the transcription units of aprokaryote. As shown in the figure, the fact that many false CDS, i.e.so called “gene shadows”, appear in the opposite strand of a true CDS,is a cause of deterioration of the accuracy of CDS determination. TheORFs are shown by “white boxes” or by “white and gray colored” boxes,while the CDS are shown by gray colored boxes. Furthermore, the SDsequences are shown by small black boxes, while the candidates for startcodons (normally ATG, GTG, and TTG) are shown by black triangles, andthe stop codons are shown by black dots.

FIG. 3 is a figure showing the flow chart of an algorithm of a “methodof determining a genetic structure from the viewpoint of transcriptionunit structure” according to the present invention.

FIG. 4 is a figure showing, in the “method of determining a geneticstructure from the viewpoint of transcription unit structure” accordingto the present invention, the value of a rank function which determinesa priority ranking of the position of a start codon, when two codingregions constitute an operon. The value R_(up) in the figure is aninteger which is greater than the distance between the stop codon forcreating a polycistron for which a value of three times (R_(up)−2) hasbeen set in advance and the start codon, and moreover is an integergreater than 10. R_(DN) is an integer which is not greater than a value(for example 30) which has been set in advance, and which is greaterthan (R_(up)+2)

FIG. 5 is a figure showing, as a flow chart, the process of computationof the “ribosome binding score” of a coding region A (CDS-A).

FIG. 6 is a figure showing a paired state between a SD sequence of aEscherichia coli trpL gene and a 16S rRNA 3′ terminal sequence.

FIG. 7 is a figure showing a flow chart of an algorithm of a variant ofa “method of CDS determination aimed at transcription unit structure” ofthe present invention, and is continued in FIG. 8.

FIG. 8 continues from FIG. 7, and is a figure showing a flow chart ofthe algorithm of the variant of the “method of CDS determination aimedat transcription unit structure” of the present invention.

FIG. 9 is a figure showing as a flow chart a process for, after havingdetermined a plurality of CDSs and transcription units from thenucleotide sequences of the plus strand and of the minus strand usingthe “method of CDS determination aimed at transcription unit structure”of the present invention, deciding upon the truth or the falsity oftranscription units which overlap.

FIG. 10 is a figure showing as a flow chart a example of a method fordetermining a plurality of CDSs by repeatedly using the “method of CDSdetermination aimed at transcription unit structure”. In the figure, X₁(%) normally takes a value from 5 to 20.

FIG. 11 is a figure showing as a flow cart a process for enhancing theaccuracy of CDS determination by deciding upon the truth or the falsityof a coding region A (CDS-A) using a shadow discrimination function. Inthe figure, P₁ (%) normally takes a value from 5 to 20.

FIG. 12 is a structural figure of hardware for desirably implementingthe method of the present invention.

FIG. 13 is an overall processing flow diagram of one embodiment of themethod of determining a genetic structure of the present invention.

FIG. 14 is an overall processing flow diagram of one embodiment of themethod of determining a genetic structure of the present invention.

FIG. 15 is an overall processing flow diagram of one embodiment of themethod of determining a genetic structure of the present invention.

BEST MODE FOR CARRYING OUT THE INVENTION

Although the method of determining a genetic structure of the presentinvention is not limited to a method which involves utilizing acomputer, it is desirable to carry out the method using a computer fromthe viewpoint that the computer can process a large amount of nucleotidesequence data at high speed.

In FIG. 12 shows a hardware structure for operating the method ofdetermining a genetic structure of the present invention in a desirablemanner, and, basically, it consists of an input device 1, a storagedevice 2, a central processing unit (CPU) 3, and an output device 4, andthese are connected together by bus lines 5.

The CPU 3 is a device which executes a program to search for sequences,to calculate, compare, and store various indicators, and to designatedata to be outputted, and the like.

The input device 1 is a device for inputting the nucleotide sequencedata which must be analyzed, and for inputting designation commands forexecuting various types of processing, and the like.

The storage device 2 is a memory for storing inputted information andthe results of calculation, and memory access means for accessing thememory; and is mainly consists of an external storage means 6, a datastorage means 7, a program file 8, and the like.

The external storage means 6 is a recording medium such as a floppydisk, an magneto-optical disk, a hard disk, and a memory or the like, inwhich the data of the nucleotide sequence to be analyzed is stored.

The data storage means 7 (hereinafter also termed the memory) is adevice for storing data which is obtained by various types of processingfor operating the method of determining a genetic structure of thepresent invention. Data of ORFs, CDSs, and transcription units, whichare determined or whose truth or falsity is decided by the method ofdetermining a genetic structure of the present invention, is also storedtherein.

In the program file 8, a program for executing the method of the presentinvention, a table for the rank function described hereinafter and thelike are stored.

The output device 4 is a device for outputting the results ofdetermination or decision of ORFs, CDSS, and transcription units by themethod of determining a genetic structure of the present invention andutilizes a display, a printer, a recording medium or the like. In otherwords, the information about the positions of the translation startcodon and the stop codon and about the CDSs which are included in eachof the transcription units can be outputted by displaying it upon thedisplay, by printing it by outputting it to the printer, or by recordingit upon the recording medium,

Furthermore, it is desirable that the hardware for operating the presentinvention is connected to a transmission network 9 to transmit anexternal nucleotide sequence database via said network and to store itin an external storage means.

1. A method to determining a genetic structure from the viewpoint of atranscription unit structure

The “method of determining a genetic structure from the viewpoint of atranscription unit structure” of the present invention is method whichemploys a method for predicting the positions of start codons and thestructure of polycistronic transcription units at high accuracy, and amethod of determining a genetic structure of a prokaryote at a highaccuracy. The structure of the polycistronic transcription units can bedetermined by the method. In the method of determining a geneticstructure, the determination of the position of a start codon, thedetermination of CDS, and the determination of the structure of apolycistronic transcription unit are intimately mutually related. Inother words, in case of the determination of the position of the startcodon of a coding region CDS-A, it is examined whether CDS-A and othercoding region CDS-B form a polycistronic transcription unit, and if theCDS-A and the CDS-B have a possibility to form a polycistronictranscription unit, then the positional relationship between the stopcodon of CDS-B and the candidates for a start codon of CDS-A is examinedto decide the truth or falsity of the start codon. Furthermore, byexpressing the paired state of the mRNA sequence and the ribosomebinding sequence within the 3′ terminal sequence of the 16S rRNA as anumerical value, the position of the start codon is determined and thetruth or falsity of the CDS is decided.

FIG. 13 gives an overall processing flow diagram when carrying out themethod of determining a genetic structure according to an embodiment ofthe present invention using the device structured as shown in FIG. 12.

In FIG. 13, the nucleotide sequence information of a prokaryote which isto be analyzed is inputted via the input device at the step S11.

In the method of the present invention, the nucleotide sequence data canbe inputted to the computer by inputting the data directly from akeyboard, by reading the data which is transmitted via the internet orthe like, or by reading the data recorded on a recording medium (anexternally stored file) such as a floppy disk, an magneto-optical disk,a hard disk, a memory and the like.

In the step S12, the CPU searches for a stop codon from the nucleotidesequence data which is inputted in the step S11, searches for aprovisional start codon which yields the longest ORF in length basedupon said stop codon, and stores a set of the stop codon and theprovisional start codon as an candidate for ORF in the memory.

In the step S13, the CPU calls up from the memory two adjacent ORFswhich are present upon the same strand, and decides on the possibilityto form a single transcription unit by investigating the positionalrelationship between the translation start codon on the downstream sideand the translation stop codon on the upstream side.

In the step S14, for ORFs which are decided to have possibility to forma single transcription unit in the step S13, candidates for atranslation start codon are searched for from the region around thetranslation start codon on the downstream side, and the translationstart codon on the downstream side is determined based upon thepositional relationship with the translation stop codon on the upstreamside and a priority order which has been determined in advance, and thedetermined transcription unit and CDS are stored in the memory. Forother ORFs, candidates for translation start codon are searched for fromthe region around the above described provisional translation startcodon, from the found candidates, a candidate which has a ribosomebinding region in an appropriate position upstream thereof is selectedto determine a translation start codon, and the determined CDSs arestored in the memory.

In the step S17, the data for determined transcription units and CDSs,which is stored in the memory, is outputted via the output device.

the position of a start codon is expressed as the position of the firstbase of 3 bases of the codon in the case of the start codon upon a plusstrand, expressed as the position of the first base or the last base of3 bases of the codon in the case of the start codon upon a minus strandduring the output. The position of a stop codon is expressed as theposition of the first base or the last base of 3 bases of the codon inthe case of the stop codon upon a plus strand, and expressed as theposition of the first base of 3 bases of the codon in the case of thestop codon upon a minus strand during the output.

The method of determining a genetic structure shown in FIG. 13 isdescribed in further detail using the flow chart of FIG. 3.

The process (a) of FIG. 3 corresponds to the step S12 of FIG. 13.

The information about a nucleotide sequence which can be utilized in thepresent invention is information about the nucleotide sequence of theDNA or the RNA of a prokaryote. It is possible to utilize a nucleotidesequence of either a single strand or double strands. In the case ofdouble strands, information about the nucleotide sequence of either ofthe strands is utilized. It is possible that the nucleotide sequenceinformation of the plus strand of the double strands is utilized, andthen the nucleotide sequence information of the minus strand isutilized. When a nucleotide sequence of RNA is utilized, it is desirableto utilize it by replacing the U residues by T residues. It is possibleto predict the CDSS, not only for a nucleotide sequence of about 100bases, but for a nucleotide sequence of genome DNA of more than amillion bases.

A nucleotide sequence of a prokaryote includes the genome sequence ofEscherichia coli K-12 strain 4639221bp (GenBank accession number:U00096), the genome sequence of Bacillus subtilis strain 168 of4214814bp (GenBank accession number: AL009126), the nucleotide sequencesof DNA fragments including ribosomal operons of Escherichia coli K-12strain (GenBank accession numbers: AE000408 and AE000472, DNA fragmentsof 10944bp and 14659bp respectively in length), and the like. It ispossible to predict the regions for RNA genes in a nucleotide sequencein advance using a homology search program or a tRNA prediction programsuch as tRNA-Scan or the like, and to eliminate said regions in thedetermination of CDS by replacing the bases of said regions with x andthe like.

In the process (a), translation stop codons are searched for from thenucleotide sequence information of the prokaryote which was inputted viathe input device, and the provisional translation start codon whichyields the longest ORF is set based upon said stop codon.

The longest ORF means, for the i-th stop codon and the (i+1)th stopcodon from the 5′ terminal, the region from the ATG codon, the GTGcodon, or the TTG codon (termed the provisional start codon) whichappears first in the 3′ direction from the i-th stop codon to the(i+1)th stop codon (herein i is an integer greater than or equal to 1).A TAA codon, a TAG codon, or a TGA codon is utilized as the stop codon.

The sets of the stop codon and the provisional translation start codon(the provisional ORFs) are arranged in the order of the position of thestop codon from the 5′ terminal towards the 3′ terminal, and stored inthe memory with the positions of the stop codons and the provisionalstart codons in the above described nucleotide sequence.

ATG, GTG, and TTG are used as the start codon in the present inventionunless mentioned specifically.

The process (b) is a process which corresponds to the step S13 of FIG.13, wherein the CPU predicts the polycistronic transcription units. Whena plurality of CDSs, in other words, cistrons, forms polycistron, thetranslation of these cistrons (CDSs) is often performed by utilizing thesame ribosome binding site. In other words, the translational couplingbetween adjacent CDSs may occur. As a condition of occur thetranslational coupling between a CDS-A and a CDS-B which is presentupstream of the CDS-A, it is necessary that the provisional start codonof the longest ORF which includes the CDS-A is upstream of the stopcodon of the CDS-B, or that said provisional start codon is present in aclose position downstream of said stop codon at least. More desirably,when the start codon of the CDS-A is present upstream of the stop codonof the CDS-B, or is present within D_(S) bases downstream of said stopcodon (herein D_(S) is an integer from 20 to 100), it is decided thatthe translational coupling between the CDS-A and the CDS-B can occur, inother words, that CDS-A and the CDS-B have a possibility to form apolycistronic transcription unit. Accordingly, in this process, it isdecided that the ORF-A and the ORF-B have a possibility to form apolycistronic transcription unit if the provisional start codon of theORF-A is present upstream of the stop codon of the ORF-B (S302), or ifit is present within Ds bases downstream of said stop codon (S303), andthe decision is stored in the memory; otherwise, it is decided that theset of the two adjacent ORFs (ORF-A and ORF-B) does not form a singletranscription unit.

Next, the steps of determination of the start codon of the ORF-A [thestep S14 of FIG. 13; the processes (c) to (f)] is explained.

Generally, in the expression of genes of a prokaryote, when a startcodon of ORF-A is present in the vicinity of the stop codon of the ORF-Band a ribosome finished the translation of an ORF-B, the translation ofORF-A is started even if there is no ribosome binding sequence at theupstream of said start codon. Accordingly, when a start codon is presentin the vicinity of the stop codon of ORF-B, it is determined that saidstart codon is the start codon of CDS-A. In the process (c) of thepresent invention, two adjacent ORF candidates which was stored in thememory are called up, and it is desirable to search for the start codonin the region within D_(B) bases downstream from the first T residue ofthe stop codon of ORF-B (herein D_(B) is an integer from 10 to 20) andwithin U_(B) bases upstream from said T residue (herein U_(B) is aninteger from 3 to 15) as the “vicinity of the stop codon” of ORF-B wherethe translation of ORF-A is started. More desirably the search isperformed in the region wherein D_(B) is 14 and UB is 11 (S304) as shownin FIG. 4.

It is desirable to limit the start codon, which is searched for in this“vicinity of the stop codon” of ORF-B, to ATG or GTG. This is due to theprediction that it is difficult to start the translation from a TTGcodon, when a suitable ribosome binding sequence does not exist.

If a single candidate for the translation start codon is obtained, thenit is determined that the ORF-A whose translation start codon is saidcandidate is a true coding region, and that said ORF-A and ORF-B form asingle transcription unit, and the results of determination are writteninto the memory (S312).

On the other hand, if a plurality of coding start candidates is presentin the search of S304, then, the start codon whose priority is thehighest is selected as the distance between each candidate and thetranslation stop codon of ORF-B as an indicator of the priority.

If the start codon can not be determined in the process (c), it isexamined whether a candidate for the start codon at which translation ofORF-A can start is present in “the region around the vicinity of thestop codon” of ORF-B (S305) as the process (d) . In the presentinvention, “the region around the vicinity of the stop codon” of ORF-Bis a region within R_(D) bases downstream from the first T (thymidine)residue of said stop codon and within R_(U) bases upstream from said Tresidue and desirably said “vicinity of the stop codon” is excluded fromthe region [herein R_(D) is an integer from 30 to 120, and R_(U) is aninteger from 20 to 120].

If a single translation start codon is determined, then it is determinedthat the ORF-A whose translation start codon is said candidate is a truecoding region and that the ORF-A and the ORF-B form a singletranscription unit, and the results of this determination are writteninto the memory.

In the above described processes (c) and (d), if a plurality ofcandidates for the start codon of the ORF-A is obtained, then, “regionaround the vicinity of the stop codon” of ORF-B is divided into the twoof “the region downstream of said vicinity” and “the region upstream ofsaid vicinity”, it is determined that translation start codon which hasthe highest priority is the true translation start codon according tothe priority of the three regions, “said vicinity”, “the regiondownstream of said vicinity”, and “the region upstream of said vicinity”in that order. Furthermore, when a plurality of translation start codonsis present within each region, it is determined that the start codonwhich has the highest priority is the true start codon of ORF-A,according to a priority ranking determined in advance by using thedistance from the translation stop codon of ORF-B as an indicator.

FIG. 4 shows an example in which the priority ranking of a translationstop codon which exists in the vicinity of the stop codon of the ORF-Bis expressed as the “rank function”. FIG. 4 shows an example in whichthe region within 14 bases downstream from the first T residue of thestop codon of the ORF-B and within 11 bases upstream from said T residueis set to the “vicinity of the stop codon of the ORF-B”. When thepriority ranking is determined, generally, priority ranking between theabove described three regions is determined, and the rule that thecloser to the stop codon of ORF-B, the higher is the priority ranking isapplied.

In the present invention, “determining the priority ranking using thedistance as an indicator” is applying this rule, desirably a priorityranking of the start codon is determined by utilizing, as shown in FIG.4, a value expressing the positional relationship of the candidate forthe start codon of the ORF-A with the stop codon of the ORF-B (in thepresent specification, termed “rank function”). The value of the “rankfunction” is shown in FIG. 4, and the lower is this value, the higher isthe priority. In each of the regions “the vicinity of the stop codon ofthe ORF-B”, “the region downstream of said vicinity”, and “the regionupstream of said vicinity”, the closer is the start codon of the ORF-Ato the stop codon of the ORF-B, the smaller is the value of the rankfunction.

Specifically, in FIG. 4, the “vicinity of the stop codon of the ORF-B”whose priority is the highest is the region within 14 bases in thedownstream from the first T residue of the stop codon of the ORF-B andwithin 11 bases upstream from said T residue, and a value of the rankfunction of this region is a value from 1 to 8, as shown in FIG. 4.

As described above, the translation starts if the start codon of theORF-A is present in the “region around the vicinity of the stop codon”of the ORF-B. Since the start codon of the ORF-A is far from the stopcodon of the ORF-B, in order to start translating at high efficiency, itis desirable that a ribosome binding sequence is present upstream of thestart codon of the ORF-A. In other words, in order to determine thecandidate for the translation start codon of the ORF-A as thetranslation start codon, it is desirable that a ribosome bindingsequence which enables a ribosome and mRNA to bind is present in thesequence upstream of said candidate.

A method of investigating a ribosome binding sequence includes a methodwhich comprises obtaining a score which expresses the binding statebetween a ribosome and mRNA, by expressing the paired state between amRNA sequence present in the region of 1 to 30 bases upstream of saidcandidate and a mRNA binding sequence within the 16S rRNA 3′ terminalsequence as a numerical value (S306), and determining said candidate forthe start codon as the start codon of ORF-A (S307) if the score exceedthe threshold value, or the like.

As the mRNA binding sequence within the 16S rRNA 3′ terminal sequence,many bacteria and Archaea have the sequence 3′-UUCCUCC-5′ or3′-UCCUCC-5′. Accordingly, when determining the start codon of a codingregion of a prokaryotic cell whose 16S rRNA 3′ terminal sequence is notknown, it is also desirable to utilize 3′-UUCCUCC-5′ or 3′-UCCUCC-5′ asan mRNA binding sequence within the 16S rRNA 3′ terminal sequence. It isalso possible to utilize a sequence of 6 to 13 bases which includes asequence (normally 1 to 3 bases) of a region around this sequence as anmRNA binding sequence. Furthermore, when determining the start codon ofa coding region of a prokaryotic cell for which the 16S rRNA 3′ terminalsequence is known, it is also possible to utilize said 16S rRNA 3′terminal sequence.

For starting translation at high efficiency, it is necessary for asequence termed the SD sequence which participates in the binding withthe mRNA binding sequence to be present in an appropriate position forthe start of translation upstream of the start codon. So that, it isinvestigated whether a sequence which pairs with the 16S rRNA 3′terminal sequence is present in a region of 1 to 30 bases upstream ofthe start codon.

Specifically, this investigation is performed by a method of calculatinga score which shows the binding state between the ribosome and the mRNAby expressing the paired state between a 3′-UUCCUCC-5′ mRNA bindingsequence within the 16S rRNA 3′ terminal sequence and an mRNA sequenceof 4 to 17 bases upstream of said start codon, or the paired statebetween a 3′-UCCUCC-5′ mRNA binding sequence within the 16S rRNA 3′terminal sequence and an mRNA sequence of 4 to 16 bases upstream of saidstart codon, as a numerical value. As a means of calculating the scorewhich shows the paired state between the ribosome and the mRNA, anymethod can be used, providing that it was a method of expressing as anumerical value a base pairing between an mRNA sequence which is in aregion of 1 to 30 bases upstream of the start codon of the codingregion, and an mRNA binding sequence within the 16S rRNA 3′ terminalsequence. Said method includes, for example, a method using the rule ofdissociation temperature of a nucleotide hybrid, or a method using thevalue of free energy which shows the binding state of a nucleotidehybrid [Schurr, T. et al.: Nucleic Acids Research Vol. 21, 4019-4023(1993)], or a method of searching for a ribosome binding site utilizinga weight matrix [Frishman, D. et al.: Nucleic Acids Research Vol. 26,2941-2947 (1998)], or the like.

When a score which shows the binding state between the ribosome and theMRNA is obtained by using the above described method, it is possible todetermine that a sequence which has a score which exceeds a fixedthreshold value which was set by various methods is the ribosome bindingsequence.

In the following, a method of calculating the score which shows thebinding state between the ribosome and the mRNA using the rule of thedissociation temperature of base-paired nucleotides will be explained.

The dissociation temperature of a hybrid of a DNA and an oligonucleotideoften calculated by defining pairing of G and C as 4° C., pairing of Aand T as 2° C., and pairing of G and T as 1° C., and it is possible toapply the same calculation method to a method using the rule ofdissociation temperatures of a nucleotide hybrid. The score (hereinafterthe score is referred to as the “ribosome binding score”) is obtainedbased upon a pairing between the mRNA binding sequence within the 3′terminal sequence of the 16S rRNA and the nucleotide sequence upstreamof the start codon, for example, by defining pairing of G and C as +4,pairing of A and U as +2, and pairing of G and U as +1. Furthermore, itis also desirable to allocate a penalty when no pairing occurs betweenthe bases next to the bases between which the pairing was observed. As apenalty score, for example, −1 may be utilized. Accordingly, as a methodfor calculation of the score utilizing the rule of dissociationtemperature of a hybrid substance of nucleic acids, the method isutilized of converting into a numerical value according to the fourrules:

(1) A pairing of G and C yields +4;

(2) A pairing of A and U yields +2;

(3) A pairing of G and U yields +1;

(4) When no pairing is recognized at a base pair which is adjacent to abase pair for which a pairing has been recognized, then this yields −1.

Normally, the “ribosome binding score” is calculated for the 8 distancesbetween the SD sequence and the start codon, each of which is 5 to 12base pairs, and, after obtaining the maximum value, it is also desirableto utilize the value in the decision of the truth or falsity of the CDS.

Since it is known that, the better is the paired state between the mRNAsequence and the mRNA binding sequence within the 16S rRNA 3′ terminalsequence, the higher is the starting efficiency of translation,accordingly it is possible to predict that, the greater is the value ofthe ribosome binding score, the higher is the starting efficiency oftranslation. Accordingly, it is possible to determine the position ofthe start codon from the magnitude of the value of the ribosome bindingscore. As a method of determining the start codon, it is possible toutilize a method which comprises obtaining the ribosome binding scorefor each start codon and selecting a start codon whose score exceedssome threshold value V₁ as the start codon. It is known that translationstarting can be commenced at high efficiency when the SD sequence isAGGA or AGG, and the ribosome binding scores of these sequences arerespectively 11 and 9. From this knowledge, when the rule ofdissociation temperature of nucleotide hybrid is used in the calculationof the ribosome binding score, the threshold value V₁ is set to aninteger from 7 to 14. It is known that many start codons of CDSs arepresent in the 5′ terminal region of the longest ORF. Accordingly, theribosome binding score is calculated for N candidates for the startcodon (wherein N is an integer from 5 to 10) which are found from the 5′terminal of the longest ORF. Moreover, since the closer the candidate isto the 5′ terminal, the higher is the possibility that it is the truestart codon, therefore the ribosome binding score is compared with thethreshold value V₁ in order from the candidate for the start codon whichis closest to the 5′ terminal, and it is possible to determined thecandidate for start codon for which said score first exceeds thethreshold value V₁ as the start codon. Since the translation starting iscommenced even if the SD sequence is AAG or GG, therefore it is alsodesirable, if it is not possible to determine the start codon when thethreshold value V₁ is used, to determine the start codon again by usinga plurality of threshold values having values smaller than V₁, in astepwise manner.

However, in the previously described process (d), if a candidate for thetranslation start codon is present in a region around the “vicinity ofthe translation stop codon” of ORF-B, then, since the ribosome whichterminated the translation of ORF-B performs translation starting forORF-A, the above described threshold value which is used for selectingthe start codon may be a value V₃ which is smaller than V₁. As a valuefor V₃, an integer between 4 and 12 is desirable.

Generally, as the start codon, it is considered that the efficiency oftranslation starting is highest in the order ATG, GTG, and TTG.Accordingly, it is desirable to correct the ribosome binding score so asto reflect the differences between the start codon bases. As the meansfor this correction, for example, when calculating the ribosome bindingscore, it is possible to deduct a numerical value P_(G) when the startcodon is GTG, and to deduct a numerical value P_(T) when the translationstart codon is TTG. When using a method which use the rule ofdissociation temperature of a nucleotide hybrid in a calculation forobtaining a ribosome binding sequence, it is possible to utilize aninteger between 1 and 4 as P_(G), and to utilize an integer between 2and 6 as P_(T), but more preferably, it is possible to utilize 2 asP_(G), and to utilize 4 as P_(T).

In FIG. 5, an example of a method of calculation of the ribosome bindingscore is shown as a flow chart.

First, one from among the start codon candidates which have already beenfound (ATG, GTG, TTG) is selected (S501), and, then, the distance dbetween the S_(D) and the start codon is set to 12 (S502). This meansthat the score for upstream of the start codon candidate whose pairedstate is being investigated is set to be calculated for the 8 bases ofthe region from the 17th base to the 10th base upstream.

In S503, first the score for d=12 is calculated according to the settingin S502. In other words, the score for the 8 bases of the region fromthe 17th base to the 10th base upstream of the start codon for which thepaired state is investigated is calculated by a predetermined method;and then, when in S504 the start codon is GUG or UUG, a penalty score isallocated in S505.

And thus, the steps from S503 to S505 is repeated by reducing thesetting of d by one at a time, so as to obtain the score for the region(d=11) from the 16th base to the 9th base upstream of the start codoncandidate, the score for the region (d=10) from the 15th base to the 8thbase upstream of the start codon candidate, the score for the region(d=9) from the 14th base to the 7th base upstream of the start codoncandidate, the score for the region (d=8) from the 13th base to the 6thbase upstream of the start codon candidate, the score for the region(d=7) from the 12th base to the 5th base upstream of the start codoncandidate, the score for the region (d=6) from the 11th base to the 4thbase upstream of the start codon candidate, and the score for the region(d=5) from the 10th base to the 3rd base upstream of the start codoncandidate, and the maximum value among these is defined as the ribosomebinding score of the above described start codon candidate.

Thus, the steps of S501 to S506 are repeated for all the start codoncandidates, and a ribosome binding score is obtained for all the startcodon candidates.

When a specific example of the calculation of the ribosome binding scoreis presented, as shown in FIG. 6, the ribosome binding score of trpLgene which is the first gene in the Escherichia coli tryptophan operonis calculated as 12.

The ribosome binding score which is obtained in this manner, asdescribed above, can be used for selecting the start codon whose scoreexceeds a threshold value V₁ as the start codon.

The ORF-A whose translation start codon is the codon which is selectedin this manner is determined as a true coding region, and it isdetermined that said ORF-A and the ORF-B is form a single transcriptionunit, and the results of determination are written into the memory.

For an ORF-A which is not decided to have a possibility to form apolycistronic transcription unit in process (b), or for an ORF-A whosestart codon can not to be determined in the process (e), the truth orfalsity as a start codon is decided from the candidate for start codon,including the provisional start codon which yields the longest ORF,which is present in the 5′ terminal region of the longest ORF by theprocess (f). The reason is that, if probability that the next startcodon and stop codon appear at the upstream of the correct start codonis predicted, the probability that a plurality of ATG, GTG, or TTGcodons appear without a stop codon appearing is low. Generally, ribosomebinding scores of the number N of candidates for start codon from the 5′terminal (where N is an integer from 5 to 20) are obtained by the abovedescribed method or the like (S309) and compared the score with thethreshold value V₁ (the threshold value V₁ is a value which is greaterthan the threshold value V₃), and it is possible to determine thecandidate for start codon for which said score exceeded the thresholdvalue V₁ first as the start codon (S310). However, since translationstarting is initiated even when the SD sequence is AAG or GG, it is alsodesirable, when it is not possible to determine the start codon when thethreshold value V₁ has been utilized, to determine the start codon again(S311) by using a plurality of threshold values which have values lessthan V₁, in a stepwise manner. More desirably, in addition to thethreshold value V₁, it is possible to use a threshold value V₂ which isless than V₁, and a threshold value V₃ which is less than V₂. In thiscase, it is desirable to utilize an integer from 5 to 13 as the value ofV₂, and to utilize an integer from 4 to 12 as the value of V₃.

The ORF-A whose translation start codon is the selected candidate whichcorresponds to said ribosome binding sequence is determined as a truecoding region, and the results of determination are written into thememory (S312).

Here, it is desirable to investigate the presence of a ribosome bindingsite at a plurality of threshold values which has values smaller thanV₁, and, when it is not possible to determine a translation start codon,not to take ORF-A as including a true coding region (S313).

By process (g), from the results of determination of the process (c),(e), or (f), the positions of the start codon and the stop codon, thecoding region, and the transcription unit are confirmed, and the geneticstructure is determined (S312).

It is possible to output the positions of the start codon and the stopcodon, and the information related to the CDS which is included in thevarious transcription units, by displaying them upon the display, byprinting them by outputting them to a printer, or by recording them upona recording medium. In the case of a start codon which is upon the plusstrand, the position of the start codon can be given by the position offirst base of 3 bases of the codon, while, in the case of a start codonwhich is upon the minus strand, it can be given by the position of firstbase or by the position of the last base of 3 bases of the codon. In thecase of a stop codon which is upon the plus strand, the position of thestop codon can be given by the position of first base or by the positionof last base of 3 bases of the codon, while, in the case of a stop codonwhich is upon the minus strand, it can be given by the position of thefirst base of 3 bases of the codon.

In relation to the structure of the transcription unit, it is necessaryto specify whether the transcription unit which has been determined is amonocistron, or whether it is a polycistron. Furthermore, in the case ofa polycistron, it is also desirable to output information whichdistinguishes between the first CDS, the last CDS, and the CDSs whichare present internally between the first CDS and the last CDS. As anideal example of a method of output in relation to the information aboutthe structure of a transcription unit, it is possible to allocate thelabel “1” to a CDS of a monocistron, and to allocate the labels “2”,“4”, and “3” respectively to the first CDS, to the last CDS, and to theCDSs which are present internally between the first CDS and the lastCDS.

Apart from the above described “method of determining coding regionsaimed at transcription unit structure”, it is also possible to determinethe genetic structure by utilizing a variant of this method. Thisvariant of this method is a method of determining a genetic structurefor a prokaryote which includes the processes (a1) through (g3)described below.

(a1): From the nucleotide sequence information of the prokaryote, atranslation stop codon is set, and a provisional translation start codonwhich yields the longest ORF is set based upon said translation startcodon.

(a2): After selection, from the ORFs which have been obtained by theprocess (a1), one for which the length of said ORF (the positionaldifference between the position of the first translation stop codon andthe position of the first translation start codon) is greater than orequal to L_(o) bases, the names of the first ORF, the second ORF, . . .the K-th ORF are appended to the K ORFS which have been chosen, in orderfrom the 5′ terminal side at the position of their translation stopcodons [here L_(o) is an integer from 30 to 900, and K is an integergreater than or equal to 2].

(b1): When the provisional translation start codon of the I-th ORF whichhas been obtained by the process (a2) either is upstream of thetranslation stop codon of the J-th ORF which includes a coding regionfor which it has been decided that it is “a true coding region” by theprocess (g3) among the ORFs from the first ORF to the (I-1)-th ORF, oris within D_(S) bases downstream of said translation stop codon, then itis decided that there is a possibility that the I-th ORF and the J-thORF may create a polycistronic transcription unit [here, I is an integerfrom 1 to K, J is a positive integer less than I, and D_(S) is aninteger from 20 to 100].

(c1): Within the J-th ORF for which it has been decided in the process(b1) that there is a possibility of creating a polycistronictranscription unit with the I-th ORF, after having chosen the ORF forwhich the value of J is the minimum, when there is a candidate for thetranslation start codon in the “vicinity of the translation stop codon”of the J-th ORF, in other words, in “the region within D_(B) bases inthe downstream direction from the first T residue of said stop codon andwithin U_(B) bases in the upstream direction from said T residue”, thenthis candidate is determined as being the codon start codon of the I-thORF [here, D_(B) is an integer from 10 to 20, and U_(B) is an integerfrom 3 to 15]. Here, if there are a plurality of candidates for thetranslation start codon, then a priority ranking is determined from theshorter ones, with the distance between each candidate and thetranslation stop codon of the J-th ORF taken as an indicator, andthereby the translation start codon of the I-th ORF is determined.

(d1): If it has not been possible to determined the translation startcodon of the I-th ORF in the process (c1), then it is investigated as towhether or not a candidate for the translation start codon of the I-thORF is present in a position in which it is possible to restarttranslation within the “coding region around the vicinity of the stopcodon” of the J-th ORF.

(e1): When in the process (d1) a candidate for the translation startcodon of the I-th ORF is present in said region, then, when the pairedstate between the MRNA sequence of 4 to 17 bases upstream of saidcandidate and the sequence (3′-UUCCUCC-5′) within the 16S rRNA 3′terminal sequence which is involved in the binding with the mRNA, orbetween the mRNA sequence of 4 to 16 bases upstream of said translationstart codon and the sequence (3′-UCCUCC-5′) within the 16S rRNA 3′terminal sequence which is involved in the binding with the mRNA, isexpressed as a numerical value according to the four rules describedbelow:

(1) A pairing of G and C yields +4;

(2) A pairing of A and U yields +2;

(3) A pairing of G and U yields +1;

(4) When no pairing is recognized at a base pair which is adjacent to abase pair for which a pairing has been recognized, then this yields −1.

this numerical value is taken as a “score which shows the state ofbinding between the mRNA and the ribosome (the ribosome binding score)”,and, when said score exceeds a threshold value V3, said candidate isdetermined as being the translation start codon of the I-th ORF. Here,if there is a plurality of candidates for the translation start codon,the “coding region around the vicinity of the stop codon” of the J-thORF is divided into the two portions “the region downstream of saidvicinity” and “the region upstream of said vicinity”, and the preferencefor translation starting is determined in the order “the regiondownstream of said vicinity” and “the region upstream of said vicinity”,and furthermore, if there is a plurality of translation start codonswithin each of these regions, then the candidates are selected stepwiseby determining a priority ranking from the shorter ones, with thedistance from the translation stop codon of the J-th ORF being taken asan indicator, and the one which exceeds the threshold value V₃ isdetermined as being the translation start codon of the I-th ORF.

(e2): If it has not been possible to determine the translation startcodon of the I-th ORF in the process (d1) or in the process (e1), thenreturn to the process (c1), and, among the J-th ORFs for which in theprocess (b1) it has been decided that there is a possibility of creatinga polycistronic transcription unit with the I-th ORF, after havingchosen the ORF for which the value of J is the next least, the work ofdetermination of the translation start codon of the I-th ORF by theprocesses from the process (c1) through the process (e1) is repeateduntil the length of the I-th ORF becomes less than that of the L_(O)amino acids [the L_(O) here is the same value as the L_(O) which wasshown by the process (a2)].

(f1): For the I-th ORF for which it has not been decided in the process(b1) that there is a possibility of forming a polycistronictranscription unit, or for the i-th ORF for which it has not beenpossible to determine the translation start codon during the processes(c1) through (e2), after having searched for candidates for thetranslation start codon from the 5′ terminal through at most N ones,including the provisional start codon which yields the longest ORF, a“ribosome binding score” is obtained for each of the candidates by doingthe same as in the process (e1), and the one for which said scoreexceeds the threshold value V₁ is determined as being the translationstart codon of ORF-A. Furthermore, if there is no candidate for whichthe threshold value V1 is exceeded, then one or more threshold valuesare set for which said score is smaller than V₁ and which include athreshold value V₃, and the translation start codon of ORF-A when thisthreshold value has been exceeded is determined in a stepwise manner[here, V₁ is an integer from 7 to 14 which is greater than the V₃ of theprocess (e2), and N is an integer from 5 to 20].

(g1): The I-th ORF for which it has not been possible to determine thetranslation start codon in the processes from the process (c1) throughthe process (f1) is decided to be a “false ORF”.

(g2): The I-th ORF for which the process (c1), the process (d1), theprocess (e1), or the process (f1) has determined the translation startcodon of its coding region is decided to include a “true ORF”, and thepositions of its translation start codon and its translation stop codon,its coding region, and its transcription units are confirmed, thusdetermining its genetic structure.

(g3): For all of the K ORFs from the first ORF to the K-th ORF, “true”or “false” of the coding regions is decided by the methods of theprocess (g1) and the process (g2), and, for all of the coding regionsfor which “true” has been decided, the positions of their translationstart codons and their translation stop codons, their coding regions,and their transcription units are confirmed, thus determining theirgenetic structures [here, K is an integer greater than or equal to 2].

In the following, a variant of the above described “method ofdetermining coding regions aimed at transcription unit structure”, whichis to be performed with a computer, will be explained in detail withreference to the flow charts of FIG. 7 and FIG. 8.

First, in the process (a1), from the nucleotide sequence information ofa prokaryote which has been inputted by the input device, the CPU sets atranslation stop codon, and sets the provisional translation start codonwhich yields the longest ORF based upon said translation stop codon(S701). The definitions of the translation stop codon, of thetranslation start codon, and of the longest ORF are the same as thosedescribed above.

In the process (a2), since the possibility is high, among the pluralityof ORFs which have been obtained by the process (a1), that the shortestORF is not a true coding region, the CPU calculates the length of eachof the ORFs, and those ORFs which are greater than or equal to a fixedlength are selected. Although it is possible to use from 30 to 900 basesas the length which is utilized for this selection, from 30 to 600 basesis desirable. For the ORFs which have been selected in this manner, itis desirable, in order to make the work in the subsequent processesconvenient, to assign names to the first ORF, to the second ORF, . . .to the K-th ORF in order from the 5′ terminal side, based upon thepositions of the stop codons of each ORF [here, K is an integer greaterthan or equal to 2]. This data is stored in the memory (S702).

In the process (b1), when, the provisional translation start codon ofthe i-th ORF which has been obtained by the process (a2) is upstream ofthe translation start codon of the j-th ORF among the ORFs from thefirst ORF to the (I-1)-th ORF (S704), or is within D_(S) basesdownstream of said translation stop codon (S705), the CPU decides (S706)that there is a possibility of the i-th ORF and the j-th ORF creating apolycistronic transcription unit [here I is an integer from 1 to K, andJ is a positive integer smaller than I, while D_(S) is an integer from20 to 100]. Since it also may happen than the j-th ORF is an ORF forwhich it has been decided that it is a “false ORF” by the process (g3),in this case, the CPU decides that the j-th ORF dose not create apolycistron with the i-th ORF (S703). If a plurality of ORFs has beenobtained upstream of the i-th ORF for which there is a possibility ofcreating a polycistron with the i-th ORF, then it is also desirable forthis plurality of ORFs to be chosen out, for them to be ordered byrearranging them in the order by which their numbers are small, and forthem to be stored in the memory.

The process (c1) is the same as the previously described process (c)(S708).

The process (d1) is the same as the previously described process (d)(S709).

The process (e1) is the same as the previously described process (e)(S801, S802). Furthermore, in this process, if the length of the codingregion which is computed from the position of the start codon which hasbeen determined by this process is extremely short, then it is desirableto decide that it is a “false coding region” and to discard it. For thelength which is utilized for this decision, it is desirable to utilizethe length which was utilized in the previously described process (a2).

In the process (e2), if it has not been possible to determine thetranslation start codon of the I-th ORF in the process (d1) or in theprocess (e1), return to the process (c1), and the CPU searches foranother ORF for which there is a possibility of creating a polycistronictranscription unit with the I-th ORF. In other words, after havingchosen the one, among the J-th ORFs for which it has been decided in theprocess (c1) that there is a possibility of creating a polycistronictranscription unit with the I-th ORF, for which the value of J is thenext smallest (S803, S804), the CPU performs the job of determining thetranslation start codon of the i-th ORF by the processes from theprocess (c1) through the process (e1). It is also desirable to stop therepetition from the process (c1) through the process (e1) at the timepoint that the length of the I-th ORF has become small. For this length,it is desirable to utilize the length which was utilized in thepreviously described process (a2).

In the process (f1), for the I-th ORF for which in the process (b1) itwas not decided that there is a possibility of creating a polycistronictranscription unit in the process (b1), or for the I-th ORF for which itwas not possible to determine the translation start codon by theprocesses from the process (c1) through the process (e2), although atranslation start codon is found by the CPU obtaining a “ribosomebinding score”, this method is the same as the previously describedprocess (f) (S805-S808).

In the process (g1), the I-th ORF for which the CPU has not been able todetermine the translation start codon by the processes of the process(c1) through the process (f1) is decided to be a “false ORF” (S809).

In the process (g2), although the I-th ORF for which the translationstart codon of its coding region has been determined by the process(c1), the process (d1), the process (e1), or the process (f1) is decidedas including a “true” coding region (S810), along with confirming theposition of the translation stop codon and the coding region byconfirmation of the position of the translation start codon, when in theprocess (b1) through the processes (e2) there appears a possibility thatthe I-th ORF may create a polycistron with an ORF upstream thereof, itis possible to confirm that the i-th ORF is a gene which participates inpolycistron creation.

In the process (g3), the CPU decides upon the “truth” or the “falsity”of the coding regions for all the K ORFs from the first ORF to the K-thORF [here K is an integer greater than or equal to 2] by the methods ofthe process (g1) and the process (g2), and accordingly is able todetermine the genetic structure by confirming the positions of thetranslation start codons and the translation stop codons of all thecoding regions for which it has been decided that they are “true codingregions”, and also the coding regions and transcription units. Finally,the determination results which have been obtained are outputted via theoutput device (S811).

Using the above described method of determining a genetic structure,after having predicted the structure of the transcription units alongwith selecting the candidates for CDSs from the plus strand and theminus strand of the DNA of a prokaryotic cell, it is possible to enhancethe accuracy of determination of the CDSs by deciding upon the truth orthe falsity of CDSs or transcription units which mutually overlap oneanother, by the following method. That is, after having investigated thepositional relationship of a plurality of CDSs and transcription unitswhich have been selected, if there is a transcription unit P or a CDS-Awith another transcription unit Q or another CDS-B which is present uponthe same strand being included in that transcription unit P or thatCDS-A, it is possible to decide whether it is a “false transcriptionunit” or a “false CDS”. Furthermore, if there is a transcription unit Por a CDS-A with another transcription unit Q or another CDS-B which ispresent upon the complementary strand being included in thattranscription unit P or that CDS-A, it is possible to decide whether itis a “false transcription unit” or a “false CDS”. When the transcriptionunit P or the CDS-A overlaps with the other transcription unit Q or theother CDS-B which is present upon the complementary strand, it ispossible to decide whether the one of the transcription units or CDSswhose length is the shorter is a “false transcription unit” or a “falseCDS”.

Furthermore, although the CDS or transcription units of plus strands, orthe CDS or the transcription units of minus strands, may be overlapped,if they have no relationship of mutual involvement, if the length ofboth of them is compared, it is possible to enhance the accuracy of CDSor transcription unit determination even by supposing that the shorterof them is “false CDS” or a “false transcription unit”.

Since it is known that, with two coding regions which are adjacent uponthe same strand and which meet one another, it sometimes happens that aportion of the 3′ terminal side of the upstream coding region and aportion of the 5′ terminal side of the downstream coding region mutuallyoverlap, accordingly the mutual overlapping of the coding regions isinvestigated, and, when such an overlapping has been observed, it willbe acceptable to take both of the coding regions as true coding regions.It should be understood that it is normally desirable for the length ofthis overlap to be 10% or less of the one of the coding regions whoselength is the shorter. Furthermore, since it is also known that, for acoding region upon a plus strand and a coding region upon a minusstrand, it sometimes happens that portions of their respective 3′terminal sides mutually overlap, therefore the mutual overlappingbetween the coding regions is investigated, and, when this type ofoverlap has been observed, it will be acceptable to take both of thecoding regions as true coding regions. It should be understood that itis normally desirable for the length of this overlap to be 10% or lessof the one of the coding regions whose length is the shorter.

FIG. 9 shows as a flow chart an example of the above described methodfor deciding upon the truth or the falsity of CDSs or transcriptionunits which mutually overlap one another.

Referring to FIG. 9, in S901, the nucleotide sequence and structure dataare inputted via the input device. When determining a large number ofCDSs from DNA of large size (a nucleotide sequence of DNA of 1000 basepairs or more), the CDSs are selected by repetitively utilizing theabove described “method of determining a genetic structure aimed attranscription unit structure” and so on for each of the ORFs.

In S902, for each of the plus strand and the minus strand, thetranscription units are numbered and are arranged in order from the 5′terminal to the 3′ terminal, and are stored in the memory.

In S903 through S908, first, each single transcription unit of the plusstrand and the minus strand is called out from the memory, and when, fora transcription unit Q which is present upon the complementary strand ofa transcription unit P is included in the transcription unit P, then itis decided that it is a “false transcription unit”, and this operationis repeated until the processing is concluded for all the combinations.

By the way, in S909, from each of the plus strand and the minus strand,the transcription units which have been determined are numbered andarranged in order from the 5′ terminal to the 3′ terminal, and arestored in the memory. In S910 through S916, first, each single CDS ofthe plus strand and the minus strand is called out from the memory, andwhen a CDS-A overlaps with another CDS-B which is present upon thecomplementary strand, then it is decided that the one of thetranscription units or the CDSs whose length is the shorter is a “falsetranscription unit” or a “false CDS”, and this operation is repeateduntil the processing is concluded for all the combinations.

Finally, the results of determination of transcription units which wereobtained in S909 and the results of determination of CDSs which wereobtained in S916 are outputted via the output device.

As shown in FIG. 9, first, along with performing the determination ofthe start codon for each ORF from the plus strand, the truth or thefalsity as a CDS is decided upon, and, if a CDS appears for which it hasbeen decided that it is a “true CDS”, then it is possible to enhance theaccuracy of CDS determination by investigating, by the above describedmethod, whether or not said CDS includes another CDS which is presentupon the same strand. It should be understood that, if the abovedescribed “method of determining a genetic structure aimed attranscription unit structure” is utilized, it is also possible todetermined the structure of the transcription units, as well as that ofthe CDSs. In the same manner, after having determined the structure ofthe CDSs and the transcription units from the minus strand as well, byinvestigating, by the above described method, whether or not eachtranscription unit includes another transcription unit which is presentupon the complementary strand, it is possible to enhance the accuracy ofthe determination of the transcription units by deciding that thistranscription unit which has been included in another transcription unitis a “false transcription unit”.

In recent years, the entire genome sequences of a large number ofmicrobes have been determined, and the number is increasing from year toyear. In this manner, with regard to the determination of the CDSs fromthe nucleotide sequence of a prokaryote, the necessity of determiningthe CDSs from the entire genome sequence is increasing. Accordingly,when determining a large number of CDSs from the nucleotide sequence ofDNA of large size (DNA of 1000 base pairs or more), for each of theORFs, it is desirable to select the CDSs, or to decide upon the truth orthe falsity as CDSs, by repeatedly utilizing the above described methodof determining a genetic structure. As a method of repeatedly utilizingthe method of determining a genetic structure, there is offered a methodof, after having arranged the ORFs which have been directed upon thesame strand by the positions of their stop codons, examining whethertheir start codons are present in order from the 5′ terminal side ORF,and deciding upon the truth or the falsity as CDSs. An example thereofis shown in FIG. 10. Referring to FIG. 10, in S1001, the nucleotidesequence which is to be analyzed is inputted by the input device.

In S1002 and S1003, the same processing is performed by the CPU as inthe above described S701 and S702.

In S1004 and S1005, the CPU performs the same processing as in the abovedescribed S703 through S709 and S801 through S811, and, along withstoring the results of determination in the memory, for the CDSs forwhich it has decided that they are true CDSS, proceeds to the followingdecision processing in order to enhance the accuracy of theirdetermination.

In S1006 through 1011, the CPU investigates the presence or absence ofan overlap with the upstream CDS and its length and its inclusion, and,if the amount of overlapping is great, decides the shorter of the CDSsas being false, while if it include the upstream CDS, it decides thatthis upstream CDS is false.

This processing is repeated until the processing in S1002 has beencompleted for all the ORFs, and the decision results are stored in thememory, and finally the results of decision are outputted via the outputdevice (S1013).

According to the method shown in FIG. 10, first, along with performingthe determination of the start codon for each of the ORFs from the plusstrand, its truth or its falsity as a CDS is decided upon, and, if a CDSappears for which it has been decided that it is a true CDS, then it ispossible to enhance the accuracy of CDS determination by investigatingby the above described method whether or not said CDS includes anotherCDS which is present upon the same strand. In the same manner, afterhaving determined the CDSs from the minus strand as well, byinvestigating by the above described method whether or not each CDS ortranscription unit includes another CDS or transcription unit which ispresent upon the complementary strand, it is possible to enhance theaccuracy of CDS determination by deciding that the CDS or thetranscription unit which has been included in the other CDS ortranscription unit is a “false CDS” or a “false transcription unit”.

2. A method of determining a genetic structure utilizing a shadowdiscrimination function.

Normally, it often happens that a “false CDS” is present which overlapsupon the complementary strand with a CDS of a prokaryotic cell, and theexistence of these “false CDSs” causes a difficulty with enhancement ofCDS determination accuracy. This fact that “false CDSs” appear upon thecomplementary strand is often termed “gene shadow”. With determinationmethods for CDSs which have been developed up till now, since they areaimed at the frequency of use of combinations of characteristicnucleotide sequences or codons in each “true” CDS of the prokaryote, andthey discriminate these “gene shadows”, accordingly it is necessary tofind out the true CDSs by a different method.

However, as described below, with the present invention, based upon theORF information which has been determined by the use of the abovedescribed “method of determining a genetic structure aimed attranscription unit structure” or the like, by investigating the truth orthe falsity of the CDSs by utilizing a method of calculating fordiscriminating these “gene shadows”, it is envisaged that it is possibleto enhance the accuracy of CDS determination, even though no informationas to the correct CDSs is available in advance. In other words, it ispossible to enhance the accuracy of CDS determination by first, basedupon the information for a plurality of CDS which has been determined byutilizing the above described “method of determining a genetic structureaimed at transcription unit structure” or the like, after havingselected k combinations of codons for which the frequency of appearanceof these codons within CDSs which have been decided to be “true CDSs” ishigh, and the frequency of appearance of the codons which have thecomplementary sequence to the 3-base sequence of said codons in saidCDSs is low, deciding upon the truth or the falsity of a CDS-A byutilizing a method of calculation. (hereinafter termed a “shadowdiscrimination function”) which makes it possible to compare the “numberof times the k types of codon whose frequency of appearance is highappear in the CDS-A” and the “number of times the k types of codon whosefrequency of appearance is low appear in the CDS-A” [here, k is aninteger greater than or equal to 5 and less than or equal to 20].

In the following this method will be abbreviated as “a method ofdetermining a genetic structure using a shadow discrimination function”.

The method of determining a genetic structure using a shadowdiscrimination function of the present invention is not limited to CDSswhich have been determined by the above described “method of determininga genetic structure aimed at transcription unit structure”; it can alsobe applied to CDSs which have been determined by any known method. Andalthough, by applying the “method of determining a genetic structureusing a shadow discrimination function” of the present invention, it ispossible to enhance the accuracy of determination for CDSs which havebeen determined by a method of determining coding regions which has beenknown from the past—such as, for example, GenMark [Borodovsky, M &Mcininch, J.: Computers Chem. Vol. 17, 123-133 (1993)], GenMark.hmm[Lukashin, A. V. & Borodovsky, M.: Nucleic Acids Research Vol. 26,1107-1115 (1998), Besemer, J. & Borodovsky, M.: Nucleic Acids ResearchVol. 27, 3911-3920 (1999)] Glimmer [Salzberg, S. et al.: Nucleic AcidsResearch Vol. 26, p. 544-548 (1998), Delcher, A. L.: Nucleic AcidsResearch Vol. 27, p. 4636-4641 (1999)], CRITICA [Badger, J. H. & Losen,G. J. et al.: Mol. Biol. Evol. Vol. 16, p. 512-524 (1999)], ORPHELUS[Frishman, M. et al.: Nucleic Acids Research Vol. 26, p. 2941-2947(1998)], or GenMarkS [Besemer, J., Lomsadze, A. & Borodovsky, M.:Nucleic Acids Research Vol. 29, 2607-2618 (2001)] and thelike—nevertheless it is not limited thereto.

This “method of determining a genetic structure using a shadowdiscrimination function” can be implemented by causing a computer toperform processing for: (k) among a plurality of T coding regions forthe prokaryote which have already been determined and have been inputtedvia an input means, investigating the type of the codons which areutilized and the number thereof, and, from among them, selecting k typesof combination of codons for which “the appearance frequency of somecodon is high, and the appearance frequency of a codon which has thecomplementary sequence of the 3-base sequence of said codon is low”, andstoring them in the memory;

(1) from the data of a coding region A, which is assumed to be thecoding region, and which has been inputted via an input means, measuringthe frequency of appearance of the above described selected codons insaid coding region A, and, by comparing together the above described“number of appearances in a coding region A which is assumed to be acoding region of the k types of codons whose frequency of appearance ishigh” and the “number of appearances in said coding region A of the ktypes of codons whose frequency of appearance is low”, deciding upon thetruth or falsity of said coding region A [Here k is an integer greaterthan or equal to 5 and less than or equal to 20];

and displaying the results of the above described decision upon anoutput device.

It is also possible to utilize any calculation method as the “shadowdiscrimination function”, which is the method of calculation which makesit possible to compare the above described “number of times the k typesof codon whose frequency of appearance is high appear in the CDS-A” andthe “number of times the k types of codon whose frequency of appearanceis low appear in the CDS-A”, provided that it includes a comparisonbetween the number of time the former codon appears and the number oftimes the latter codon appears. As an example of a suitable method forcalculating this “shadow discrimination function”, if the number oftimes the former codon appears is supposed to be “H” and the number oftimes the latter codon appears is supposed to be “L”, there are offered:H/L, L/H, (H/L+1), (L/H+1), 1/(H/L+1), 1/(L/H+1), 2H/(H+L), and thelike. Among these methods of calculation, it is possible to enhance theaccuracy of CDS determination by deciding that said CDS-A is a “falseCDS” by using “the reciprocal of the sum obtained by adding 1 to theratio of the number of the latter to the number of the former”, andfurthermore when the value of said reciprocal is less than a fixed value[here k is an integer which is greater than or equal to 5 and is lessthan or equal to 20].

In the following, a concrete example will be explained of this “shadowdiscrimination function” method.

First, when the 64 types of codons:

TTA, CTA, TCA, TTT, TTC, TTG, TCT, TCC, TCG, TAT, TAC, TGT, TGC, TGG,CTT, CTC, CTG, CCT, CCC, CCG, CAT, CAC, CGT, CGC, ATT, ATC, ACT, ACC,AAC, AGC, GTC, GCC, TAA, TAG, TGA, AAA, GAA, CAA, AGA, GGA, CGA, ATA,GTA, ACA, GCA, CCA, AAG, GAG, CAG, AGG, GGG, CGG, ATG, GTG, ACG, GCG,AAT, GAT, AGT, GGT, GTT, GCT, GAC, and GGC

have been lined up in order, the 3 base of the i-th codon (where i isless than or equal to 32) has the complementary sequence to the 3 baseof the (i+32)-th codon. When the frequency at which the 3 base of thei-th codon appears in true CDSs is high, and the frequency at which the3 base which corresponds to the (i+32)-th codon appears in true CDSs islow, the frequency with which the 3 base of the i-th codon appears inthe nucleotide sequence of the opposite strand becomes low, and thefrequency at which the 3 base which corresponds to the (i+32)-th codonappears in the nucleotide sequence of the opposite strand becomes high.Due to this fact, when the codons which appear in a true CDS have beenanalyzed, the difference between the frequency of appearance of the 3base of the i-th codon and the frequency of appearance of the 3 base ofthe (i+32)-th codon is obtained, and, the greater this difference is,the likelier does it become that the i-th codon appears in a true CDS,and furthermore the likelier does it become that the (i+32)-th codonappears in the nucleotide sequence of the opposite strand of a true CDS.

In other words, T CDSs are selected as true CDSs, and, when the numberof times that the i-th codon appears in the t-th CDS which has beenselected is termedC^(t) _(j)

then the above described difference y_(i) and Y_(i+32) is given by thefollowing equation: $\begin{matrix}{y_{i} = {\left( {{\sum\limits_{t = 1}^{T}\quad C_{i}^{t}} - {\sum\limits_{t = 1}^{T}\quad C_{i + 32}^{t}}} \right)/{\sum\limits_{i = 1}^{T}\quad{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}} & (2) \\{y_{i + 32} = {\left( {{\sum\limits_{t = 1}^{T}C_{i + 32}^{t}} - {\sum\limits_{i = t}^{T}C_{j}^{t}}} \right)/{\sum\limits_{t = 1}^{T}{\sum\limits_{j = 1}^{64}C_{j}^{t}}}}} & (3)\end{matrix}$

Next, the values of y_(i) and Y_(i+32) are calculated (where i is aninteger less than or equal to 32), and the above described 64 types ofcodons are arranged in the order of magnitude of this value. When theleading k codons in order of magnitude of the value of y_(i) or Y_(i+32)are chosen, the value of the shadow discrimination function of the n-thCDS (hereinafter abbreviated as Sd) is given by the following equation:$\begin{matrix}{{Sd}_{n} = {2 \times {\sum\limits_{i = 1}^{k}\quad{C_{i}^{n}/\left( {{\sum\limits_{i = 1}^{k}\quad C_{i}^{n}} + {\sum\limits_{t = {65 - k}}^{64}\quad C_{i}^{n}}} \right)}}}} & (8)\end{matrix}$

(k is an integer from 5 to 20)

Here, when$\left( {{\sum\limits_{i = 1}^{k}\quad C_{i}^{n}} + {\sum\limits_{i = {65 - k}}^{64}\quad C_{i}^{n}}} \right)$

is zero, the value of Sdn is taken as being 1.

By doing the following, based upon the value of this shadowdiscrimination function Sd, it is possible to decide, when the value ofSd for some ORF exceeds a threshold value (for example 1.0) which isspecified in advance, that it is a “true” CDS. Furthermore, when twoCDSs overlap, or are in an inclusion relationship, it is possible todecide upon the truth or the falsity of the CDSs by calculating the Sdvalue of each CDS, and by comparing their values. When deciding upon thetruth or the falsity, not of a CDS, but of a transcription unit, afterhaving computed the Sd value based upon all the codons of the CDSs whichmake up the transcription unit, it is possible to decide upon the truthor the falsity of the transcription unit from this Sd value.

In the above described method, as shown in FIG. 14:

(m) constructing a codon table by arranging the 64 types of codons sothat the 3-base sequence of the i-th codon has the complementarysequence to the nucleotide sequence of the (i+32)-th codon, and storingit in the memory (S1401) (n) inputting the nucleotide sequence of Tcoding regions of a prokaryote which have already been determined, and,when the number of times that the i-th codon appears in the t-th codingregion is taken asC^(t) _(j)obtaining y_(i) from the equation (2) below and Y_(i+32) from theequation (3) below following it: $\begin{matrix}{y_{i} = {\left( {{\sum\limits_{t = 1}^{T}\quad C_{i}^{t}} - {\sum\limits_{t = 1}^{T}\quad C_{i + 32}^{t}}} \right)/{\sum\limits_{i = 1}^{T}\quad{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}} & (2) \\{y_{i + 32} = {\left( {{\sum\limits_{t = 1}^{T}C_{i + 32}^{t}} - {\sum\limits_{i = t}^{T}C_{j}^{t}}} \right)/{\sum\limits_{t = 1}^{T}{\sum\limits_{j = 1}^{64}C_{j}^{t}}}}} & (3)\end{matrix}$

(S1402)(o) calling out the codon table which was obtained in the step (m) fromthe memory, setting up a correspondence between the y_(i) and Y_(i+32)for the codons in the table, and, after having rearranged the sequenceof the codons in the table in the order of magnitude of the y_(i) andthe Y_(i+32) (S1403), choosing k leading codons for which the value ofy_(i) or of Y_(i+32) is large, and obtaining the value of Sd_(A) for acoding region A by the following equation (4): $\begin{matrix}{{Sd}_{A} = {2 \times {\sum\limits_{i = 1}^{k}\quad{C_{i}^{A}/\left( {{\sum\limits_{i = 1}^{k}\quad C_{t}^{A}} + {\sum\limits_{i = {65 - k}}^{64}\quad C_{i}^{A}}} \right)}}}} & (4)\end{matrix}$[Here, when$\left( {{\sum\limits_{i = 1}^{k}\quad C_{i}^{A}} + {\sum\limits_{i = {65 - k}}^{64}\quad C_{i}^{A}}} \right)$

is zero, the value of Sd_(A) is taken as being 1] (S1404);

(p) When the value of Sd_(A) for the coding region A which has beenobtained by the above described processing is greater than a thresholdvalue S₁, then said coding region is take as a true coding region,while, when said value of Sd_(A) is less than the threshold value S1, itis taken as a false coding region (S1405) [here T is an integer greaterthan or equal to 2, i is a positive integer less than or equal to 32, jis a positive integer less than or equal to 64, t is a positive integerless than or equal to T, k is an integer from 5 to 20, and S₁ is a valuefrom 0.8 to 1.8].

An implementation is possible in which a computer is caused to executeprocessing to output the above described decision results via an outputdevice.

As an example of taking advantage of this shadow discrimination functionSd for deciding upon the truth or the falsity of two CDSs ortranscription units, after having determined long CDSs (the length ofthe polypeptide for which the CDSs code is greater than or equal to Lamino acids, where L is a positive integer greater than or equal to 100)by the above described “method of determining a genetic structure aimedat transcription unit structure”, the value of the above describedshadow discrimination function Sd is obtained from the sequenceinformation of these CDSs, and it is possible to enhance the accuracy ofCDS determination by deciding upon the truth or the falsity of the CDSsby using said Sd value as a threshold value. In this method, as a methodof determining the CDSs, it will be acceptable to utilize some methodother than the above described “method of determining a geneticstructure aimed at transcription unit structure”.

When a CDS or a transcription unit upon the plus strand and a CDS or atranscription unit upon the minus strand overlap, or are in an inclusionrelationship, if the lengths of them both are greatly different, it willbe acceptable to decide that the one of these CDSs or transcriptionunits whose length is the shorter is a “false CDS” or a “falsetranscription unit”. However, if their lengths do not differ greatly,then it is possible to decide upon the truth or the falsity of the CDSsor transcription units by the above described method, and in particularby comparing the values of the shadow discrimination function for theindividual CDSs or transcription units. More desirably, it is possibleto compare together the length L_(A) (in base pairs) of the CDS-A or ofthe transcription unit A and the length L_(B) of the CDS-B or of thetranscription unit, and to decide that the CDS-B or the transcriptionunit B is a “false CDS” or a “false transcription unit” when L_(B) isless than or equal to T_(P) % of L_(A) [here T_(P) is a positive integerfrom 30 to 95]. Furthermore, when L_(B) exceeds T_(P) % of L_(A), it ispossible to decide upon the truth or the falsity of the CDS-A or of thetranscription unit A; or the CDS-B or of the transcription unit B by theabove described method which uses the “shadow discrimination function”[here T_(P) is a positive integer from 30 to 95].

This method of deciding upon the truth or the falsity of a CDS-A or atranscription unit A and a CDS-B or a transcription unit is not limitedto CDSs which have been determined by the above described “method ofdetermining a genetic structure aimed at transcription unit structure”;it can also be applied to CDSs or transcription units which have beendetermined by any known method. And although, by applying the abovedescribed method of deciding upon the truth or the falsity of a CDS-A ora transcription unit A and a CDS-B or a transcription unit B to CDSs ortranscription units which have been determined by a method which hasbeen known from the past—such as, for example, GenMark [Borodovsky, M &Mcininch, J.: Computers Chem. Vol. 17, 123-133 (1993)], GenMark.hmm[Lukashin, A. V. & Borodovsky, M.: Nucleic Acids Research Vol. 26,1107-1115 (1998), Besemer, J. & Borodovsky, M.: Nucleic Acids ResearchVol. 27, 3911-3920 (1999)], Glimmer [Salzberg, S. et al.: Nucleic AcidsResearch Vol. 26, p. 544-548 (1998), Delcher, A. L.: Nucleic AcidsResearch Vol. 27, p. 4636-4641 (1999)], CRITICA [Badger, J. H. & Losen,G. J. et al.: Mol. Biol. Evol. Vol. 16, p. 512-524 (1999)], ORPHELUS[Frishman, M. et al.: Nucleic Acids Research Vol. 26, p. 2941-2947(1998)], or GenMarkS [Besemer, J., Lomsadze, A. & Borodovsky, M.:Nucleic Acids Research Vol. 29, 2607-2618 (2001)] and the like—it ispossible to determine their truth or the falsity, nevertheless it is notlimited thereto.

In the method shown in FIG. 10, in the processing for deciding upon thetruth or the falsity of CDSs or transcription units which mutuallyoverlap with one another, it is possible to perform the decision byutilizing the above described “shadow discrimination function”.

As described above, in recent years, the requirement has increased fordetermining the CDSs from the entire genome sequence of a microbe athigh speed and moreover at high accuracy. In the following, an exampleof a method for enhancing the accuracy of CDS determination by takingadvantage of the value of the above described shadow discriminationfunction Sd when determining the CDSs from the entire genome sequence ofa prokaryote will be described in detail.

When determining a large number of CDSs from the nucleotide sequence ofDNA of large size (DNA of 1000 base pairs or more), for each of theORFs, the CDSs are selected by repeatedly utilizing the above described“method of determining a genetic structure aimed at transcription unitstructure” or the like.

As shown in FIG. 9, first, along with performing determination of thestart codon for each ORF from the plus strand, its truth or its falsityas a CDS are decided upon, and, if a CDS has appeared for which it hasbeen decided that it is a “true CDS”, then it is possible to enhance theaccuracy of CDS determination by investigating, with the above describedmethod, whether or not said CDS includes another CDS upon the samestrand. It should be understood that, if the above described “method ofdetermining a genetic structure aimed at transcription unit structure”is employed, it is possible to determine, not only the structure of theCDSs, but also the structure of the transcription units. After havingdetermined the structure of the CDSs and of the transcription units fromthe minus strand in the same manner, it is possible to enhance theaccuracy of transcription unit and CDS determination by investigating bythe above described method whether or not each transcription unitincludes another transcription unit upon the complementary strand, andby deciding that a transcription unit which has been included in anothertranscription unit is a “false transcription unit”. Based upon theinformation for the CDSs which has been determined in this manner, kcombinations (where k is an integer from 5 to 20) of the two types ofcodons are selected: ones for which the frequency of the codons withinCDSs for which it has been decided that they are “true CDSs” is high,and ones for which the frequency of appearance of codons which have thecomplementary sequence to the 3-base sequence of said codons in saidCDSs is low. Based upon the “k types of codons of which the frequency ofappearance is high” and the “k types of codons of which the frequency ofappearance is low” which have been selected in this manner, it ispossible to obtain the value of the shadow discrimination function Sdfor each of the CDSs by the above described method.

Next, the structure of the CDSs and the transcription units isdetermined for a second time from the plus strand and the minus strand,using the above described “method of determining a genetic structureaimed at transcription unit structure”.

As shown in FIG. 11, by taking advantage of the value of the abovedescribed shadow discrimination function Sd for the U_(P1) transcriptionunits which have been selected from the plus strand and the U_(M1)transcription units which have been selected from the minus strand, itis possible to decide upon the truth or the falsity of the transcriptionunits, and to enhance the accuracy of determination of the transcriptionunits and the CDSs

If the CDSs have been determined by a method other than the abovedescribed “method of determining a genetic structure aimed attranscription unit structure”, when the distance between two adjacentCDSs, in other words, the distance between the stop codon of theupstream CDS and the start codon of the downstream CDS, is D_(C) basepairs (where D_(C) is an integer from 30 to 120), it is possible tocreate a transcription unit by deciding upon the creation of apolycistron.

For a transcription unit which has been created in this manner, it ispossible to decide upon the truth or the falsity of transcription unitsby taking advantage of the value Sd of the above described shadowdiscrimination function, thus making it possible to enhance the accuracyof CDS and transcription unit determination.

Referring to FIG. 11, in S1101, the data for the nucleotide sequencewhich is to be analyzed is inputted via the input device.

In S1102, the transcription units upon the plus strand and the minusstrand are selected, and based upon the positions of their stop codons,it is lined up from the 5′ terminal to the 3′ terminal and are stored inthe memory. The selecting and lining up of the transcription methods maybe performed by any of the above described methods.

Next, the CPU calls out the transcription units of the plus strand andthe minus strand which have been stored in S1102 one at a time in orderfrom the 5′ terminal, and investigates whether or not they are mutuallyincluded (S1103, S1106), and, if they are thus included, compares thelengths of the two transcription units (S1104, 1107), and, if one ofthem is less than P₁% of the other, calculates the shadow discriminationfunction for each of them, compares them (S1105, S1108), and takes theone whose shadow discrimination function is the smaller as being false.

This operation is performed for the transcription units which have beenstored in the memory in S1102, and as a result, the transcription unitsof the plus strand and the minus strand which have not been taken asfalse are selected, based upon the positions of their stop codons, it islined up as standard from the 5′ terminal to the 3′ terminal, and arestored in the memory (S1109).

Next, the CPU calls out the transcription units of the plus strand andthe minus strand which have been stored in the memory in S1109 one at atime in order from the 5′ terminal, and investigates whether or nottheir 5″ terminal sides and 3″ terminal sides are overlapped (S1110,S1113), and, if they do thus overlap, compares the lengths of the twotranscription units (S1111, 1114), and, if one of them is less than P₁ %of the other, calculates the shadow discrimination function for each ofthem, compares them (S1112, S1115), and takes the one whose shadowdiscrimination function is the smaller as being false.

This operation is performed for the transcription units which have beenstored in the memory in S1109, and as a result, the transcription unitsof the plus strand and the minus strand which have not been taken asfalse are selected, and the results thereof are outputted via the outputdevice (S1116).

3. A method of determining a genetic structure aimed at the GC contentof the bases in the codons

When determining CDSs from a nucleotide sequence of a prokaryote inwhich the GC content is high, it is difficult to determine the CDSs athigh accuracy, since a large number of long ORFs are present. With thepresent invention it is apparent that, as a characteristic of the CDSsof a prokaryote whose GC content exceeds 50%, the GC content of thefirst and the third bases of the codons within said CDSs is high, and a“method of determining a genetic structure aimed at the GC content ofthe bases in the codons” has been conceived of which takes advantage ofthis characteristic for enhancing the accuracy of CDS determination. Inother words, this is a method which, after having calculated saidcontent using a calculation expression which takes into account thecontents of the first and the third G residue and C residue of thecodons within said CDS, enhances the accuracy of determination of theCDSs of the prokaryote by deciding that said CDS is a “false CDS” whensaid content is less than a fixed value. Although it is possible toapply this method when the value of the GC content of the nucleotidesequence is greater than or equal to 50%, it is more desirable for it tobe greater than or equal to 60%. It is possible to utilize any type ofcalculation expression as the calculation expression for this method,provided that it is a calculation expression which yields a content ofthe first and the third G residue and C residue of the codons withinsaid CDS; and, as examples of suitable such calculation expressions, itis possible to utilize an “expression obtained by dividing the total ofthe first and the third G residues and C residues of all the codonswithin said CDS by the number of bases of all the codons”, an“expression obtained by dividing said total by the total of G residuesand C residues of all the codons”, or an “expression obtained bydividing said total by the total of the second G residues and C residuesof all the codons”. More desirably, a calculation expression isutilized, for which it is difficult for the value of said calculationexpression to receive influence from the GC content of the nucleotidesequence. As an “expression obtained by dividing the total of the firstand the third G residues and C residues of all the codons within saidCDS by the total of bases of G residue and C residue of all the codons”,for the i-th CDS, it is possible to utilize the value of GCi(hereinafter abbreviated as the “GC function” or GC) which is computedby the expression (5) below:

[Here, when the r-th base (r=1, 2, 3) of the n-th codon of the i-th CDSis b (b=1, 2, 3, 4), then $\begin{matrix}{{{GC}_{i} = {{\left( {y^{i{(1)}} + y^{i{(3)}}} \right)/{\sum\limits_{r = 1}^{3}\quad{y^{i{(r)}}\quad{where}\quad y^{i{(r)}}}}} = {\sum\limits_{n = 1}^{N_{i}}\quad{\sum\limits_{b = 1}^{4}\quad x_{n{(b)}}^{i{(r)}}}}}}{where}{{x_{n{(b)}}^{i{(r)}}\quad{is}\quad x_{n{(b)}}^{i{(r)}}} = {1\quad\left( {b = {1\quad{or}\quad 2}} \right)}}\quad{x_{n{(b)}}^{i{(r)}} = {0\quad\left( {b = {3\quad{or}{\quad\quad}4}} \right)}}} & (5)\end{matrix}$

and, at this time, when the r-th base of the n-th codon of the i-th CDSis G, C, A, or T, then b is respectively 1, 2, 3, or 4. It should beunderstood that i and n are positive integers, while N_(i) is the totalof the codons in the i-th CDS (excluding its stop codon).

When the value of the “GC function” of a CDS whose GC content is 50% isobtained, on average it is ⅔ (0.66666). In other words, the value of the“GC function” is a value scattered around ⅔. Here, it will be understoodthat the value of the “GC function” of a CDS whose GC content exceeds60% exceeds ⅔ for almost all CDSs. Accordingly, when discriminating thetruth or the falsity of a CDS using the above described “GC Function”,it is desirable to utilize a numerical value within the range of 0.6 to0.75 as a fixed value.

The “method of determining a genetic structure aimed at the GC contentof the bases in the codons” of the present invention is not limited toCDSs which have been determined by the above described “method ofdetermining a genetic structure aimed at transcription unit structure”;it can also be applied to CDSs or transcription units which have beendetermined by any known method. And although, by applying the “method ofdetermining a genetic structure aimed at the GC content of the bases inthe codons” of the present invention to CDSs which have been determinedby a method of determining coding regions which has been known from thepast—such as, for example, GenMark [Borodovsky, M & Mcininch, J.:Computers Chem. Vol. 17, 123-133 (1993)], GenMark.hmm (Lukashin, A. V. &Borodovsky, M.: Nucleic Acids Research Vol. 26, 1107-1115 (1998),Besemer, J. & Borodovsky, M.: Nucleic Acids Research Vol. 27, 3911-3920(1999)], Glimmer (Salzberg, S. et al.: Nucleic Acids Research Vol. 26,p. 544-548 (1998), Delcher, A. L.: Nucleic Acids Research Vol. 27, p.4636-4641 (1999)], CRITICA [Badger, J. H. & Losen, G. J. et al.: Mol.Biol. Evol. Vol. 16, p. 512-524 (1999)], ORPHELUS [Frishman, M. et al.:Nucleic Acids Research Vol. 26, p. 2941-2947 (1998)], or GenMarkS[Besemer, J., Lomsadze, A. & Borodovsky, M.: Nucleic Acids Research Vol.29, 2607-2618 (2001)] and the like—it is possible to enhance theaccuracy of their determination, nevertheless it is not limited thereto.

The above described “method of determining a genetic structure aimed atthe GC content of the bases in the codons”, as shown in FIG. 15, can beimplemented by causing a computer to perform the processing of: from thedata (S1501) of a coding region which has already been selected of aprokaryote for which the GC content exceeds 50% and which has beeninputted via an input device, calculating the content of the first andthe third G residue and C residue of the codons within the abovedescribed coding region by utilizing a predetermined calculationexpression (S1502), and, when the content which has been obtained bythis calculation is less than a fixed value, deciding that said codingregion is a “false coding region” (S1503), and outputting the results ofthis decision via an output device.

When determining a CDS, it may happen that the accuracy of CDSdetermination is deteriorated by selecting a false start codon. When acodon which is upstream of the true start codon of a CDS has beenselected as its start codon, the false CDS comes to be formed as linkedto the 5′ terminal of the true CDS. Thus, after having computed thecontent of the first and the third G residue and C residue of the codonof the 5′ terminal side region of said CDS by utilizing a calculationexpression which yields said content, when said content is less than afixed value, the possibility is high that the 5′ terminal side region isnot a portion of the true CDS. In such a case, it is possible to enhancethe determination accuracy of the start codon by again performing thesearch for the translation start codon in the downstream direction fromthis start codon. It is also possible to utilize any method as a methodfor again searching for this start codon. For example, there may beoffered a method of determining the start codon by taking the fact thata ribosome binding sequence is present upstream of the start codon as anindicator, or the method of determining the start codon which is usedduring the “method of determining a genetic structure aimed attranscription unit structure” of the present invention. As a calculationexpression which yields the content of the first and the third G residueand C residue of the codons of the 5′ terminal side region of said CDS,it is possible to utilize the above described calculation expressions.Furthermore, as the 5′ terminal side region, a region of length from 30to 300 base pairs is desirable.

The above described method can be implemented by causing a computer toperform the processing of: from the data of a coding region which hasalready been selected of a prokaryote for which the GC content exceeds50% and which has been inputted via an input device, calculating thecontent of the first and the third G residue and C residue of the codonsof the region of the 5′ terminal side of the above described codingregion by utilizing a predetermined calculation expression; and, whenthe content which has been calculated is less than a fixed value,deciding that the translation start codon of said coding region is a“false translation start codon”, and, along with outputting the resultsof this decision via an output device, also calling out the nucleotidesequence data of the above described coding region which has beeninputted via the input device, and again searching for a translationstart codon which is present downstream of said false translation startcodon.

The above described “method of determining a genetic structure aimed atthe GC content of the bases in the codons” can be utilized incombination with a different method of CDS determination. Desirably, itis possible to enhance the accuracy of CDS determination for aprokaryote by utilizing the “method of determining a genetic structureaimed at transcription unit structure” and the “method of determining agenetic structure using a shadow discrimination function” of the presentinvention.

Although it also may happen that, when determining the CDS of aprokaryotic cell, the accuracy of CDS determination is not enhanced bycombining two per se known method of determining a genetic structures,in many cases, it is possible further to enhance the accuracy of CDSdetermination by combining a “method of deciding upon the truth or thefalsity of a CDS by utilizing the coding potential” which has beencomputed by using the appearance frequency and the codon use frequencyof the nucleotide sequence of the true CDS and a possibility processmodel such as a Markov model or a hidden Markov model or the like withany of the “method of determining a genetic structure aimed attranscription unit structure”, the “method of determining a geneticstructure using a shadow discrimination function”, or the “method ofdetermining a genetic structure aimed at the GC content of the bases inthe codons” of the present invention.

Although, as an example of said “method of deciding upon the truth orthe falsity of a CDS by utilizing the coding potential”, it is possibleto suggest a method of, based upon the nucleotide sequence of T CDSs ofthe prokaryote which have already been determined, deciding upon thetruth or the falsity of said CDS-A by utilizing a calculation expressionwhich is capable of comparing the “number of times m types of codonswhose frequency of appearance is high appear in the CDS-A” and the“number of times m types of codons whose frequency of appearance is lowappear in the CDS-A” in the T CDSs, it would also be possible to utilizeany calculation expression, provided that it were a calculationexpression which gives a coding potential [here, T is a integer greaterthan or equal to 2, and m is an integer greater than or equal to 5 andless than or equal to 20].

As an example of said calculation expression, although, as a calculationexpression which is capable of comparing the “number of times m types ofcodons whose frequency of appearance is high appear in the CDS-A” andthe “number of times m types of codons whose frequency of appearance islow appear in the CDS-A”, it is possible to suggest “the reciprocal ofadding 1 to the ratio of the latter number of times to the former numberof times”, a “Cd value” which is an example of this reciprocal is shownbelow: $\begin{matrix}{{Cd}_{A} = {2 \times {\sum\limits_{t = 1}^{m}\quad{C_{i}^{A}/\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{61}\quad C_{i}^{A}}} \right)}}}} & (7)\end{matrix}$

Here, when$\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{61}\quad C_{i}^{A}}} \right)$

is zero, the value of Cd_(A) is taken as being 1.

An example of the above described “method of deciding upon the truth orthe falsity of a CDS by utilizing the coding potential” is:

(m) A codon table is created in which the 64 types of codons arearranged so that the sequence of the 3 base of the i-th codon has thecomplementary sequence to the nucleotide sequence of the (i+32)-thcodon, and it is stored in the memory;

(s) When the number of times the i-th codon appears in the t-th codingregion is taken asC^(t) _(j)then y_(i) is obtained by the following equation (6): $\begin{matrix}{y_{i} = {\sum\limits_{t = 1}^{T}\quad{C_{i}^{t}/{\sum\limits_{t = 1}^{T}\quad{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}}} & (6)\end{matrix}$

(t) The codon table is called out from the memory, and, after havingrearranged the 64 types of codon in the order of magnitude of y_(i), “mleading codons for which the value of y_(i) is large” and “m trailingcodons for which the value of y_(i) is large, excluding the translationstop codon” are chosen, and the value of Cd_(A) for the coding region Ais obtained from the following equation (7): $\begin{matrix}{{Cd}_{A} = {2 \times {\sum\limits_{i = 1}^{m}\quad{C_{i}^{A}/\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{61}\quad C_{i}^{A}}} \right)}}}} & (7)\end{matrix}$

Here, when$\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{61}\quad C_{i}^{A}}} \right)$

is zero, the value of Cd_(A) is taken as being 1;

(u) When the value of Cd_(A) for the coding region A which has beencomputed in the step (t) is greater than or equal to a threshold valueCV, then said coding region is taken as being a true coding region,while when said value of Cd_(A) is less than the threshold value CV thenit is taken as a false coding region, and the results of this decisionare outputted via the output device [here, T is an integer greater thanor equal to 2, i is a positive integer less than or equal to 64, j is apositive integer less than or equal to 64, t is a positive integer lessthan or equal to T, m is an integer from 5 through 20, and CV is a valuefrom 0.8 to 1.8].

An implementation is possible in which this processing is caused to beperformed upon a computer.

If a CDS-A of the prokaryote overlaps with a CDS-B on the complementarystrand, and moreover the CDS-B is included in the CDS-A, then, whendetermining another CDS of the same prokaryote by utilizing the abovedescribed method, it will also be acceptable to decide upon “the truthor the falsity” of the CDS by utilizing a method which includes theprocesses (q) and (r) described below:

(q) The length L_(A) of the CDS-A (in base pairs) and the length of theCDS-B (in base pairs) are compared together, and, when L_(B) is lessthan or equal to T_(P) % of L_(A), it is decided that the CDS-B is a“false CDS”; and

(r) When L_(B) is greater than T_(P) % of L_(A), the truth or falsity ofthe CDS-A and the CDS-B are decided by the above described methodutilizing “as a calculation expression which is capable of comparing thenumber of times m types of codons whose frequency of appearance is highappear in the CDS-A and the number of times m types of codons whosefrequency of appearance is low appear in the CDS-A, the reciprocal ofthe result obtained by adding 1 to the ratio of the latter number oftimes to the former number of times”, or “the above described Cd value”[here, T_(P) is a positive integer from 30 through 95].

When deciding upon the truth or the falsity of a transcription unit byutilizing said method, it is possible to apply said method after havinglinked the CDSs which make up several transcription units, excludingtheir stop codons, and having made them into a single CDS.

It is possible to execute the “method of determining a genetic structureaimed at transcription unit structure”, the “method of determining agenetic structure using a shadow discrimination function”, or the“method of determining a genetic structure aimed at the GC content ofthe bases in the codons” which have been disclosed for the presentinvention at higher speed by utilizing a computer. For this, it isnecessary to create a program which commands the computer to performeach of the processes of the method of the present invention. Thisprogram can be made using a programming language such as C, C++, Perl,Fortran, BASIC, JAVA, or the like. And such a program can be executedupon an operating system such as UNIX, LINUX, Windows, MacOS, or thelike.

Although, provided that it is endowed with a function of operating as acomputer, it is possible to utilize any computer as the computer whichis utilized for causing the above described program to be executed, itis desirable for this computer to be one whose speed of calculation ishigh. As concrete examples, it is possible to offer the personalcomputer PCG-XR9F/K made by Sony Corporation, the personal computerLet's Note CF-A77J81 of Matsushita Electronics Manufacturing Co. Ltd.,and the SUN Ultra 80 workstation made by the company Sun Microsystems,or the like. There is no requirement to perform each method or eachprocess which has been disclosed for the present invention by utilizingthe same computer. In other words, it would also be acceptable to outputthe results which have been obtained by some process or method which hasbeen described for the present invention to another computer, and toperform the processing for the next such process or for another methodupon said computer.

By taking advantage of a recording medium which can be read in by acomputer, and upon which can be recorded a program for causing acomputer to execute the process of the “method of determining a geneticstructure aimed at transcription unit structure”, the “method ofdetermining a genetic structure using a shadow discrimination function”,or the “method of determining a genetic structure aimed at the GCcontent of the bases in the codons” disclosed by the present invention,it is possible to increase the level of automatic operation of thesemethods. Said recording medium is the recording medium of the presentinvention.

By “recording medium which can be read in by a computer” there is meantany recording medium which can be directly read in and accessed by acomputer. Although, as this type of recording medium, it is possible tooffer a floppy disk, a hard disk, a magnetic storage medium such as amagnetic tape or the like, an optical storage medium such as a CD-ROM, aCD-R, a CD-RW, a DVD-ROM, a DVD-RAM, a DVD-RW or the like, a magneticstorage medium such as RAM or ROM or the like, or a hybrid of thesecategories (for example, a magneto-optical storage medium such as MO orthe like), it is not limited to being one of these.

A genetic structure determination system based upon a computer whichtakes advantage of the above described recording medium which can beread in by a computer according to the present invention is a system ofthe present invention.

By a “genetic structure determination system based upon a computer” ismeant a system which is made up of a hardware means, a software means,and a data storage means, and which is used for analyzing theinformation which has been recorded upon the recording medium of thepresent invention which can be read in by a computer of the presentinvention.

In the following, examples of the present invention is disclosed.

Programs for commanding a computer to execute the processes which areshown in the examples described below were made using the programminglanguage Perl version 5.005, and were executed upon the operating systemWindows 2000. A “VAIO PCG-KR9F/K” manufactured by Sony Corporation wasused as the computer.

EXAMPLE 1 The Determination of the Genetic Structure from the Sequenceof a DNA Fragment Including a Threonine Operon of Escherichia coli K-12Strain

According to the process shown in the flow charts of FIG. 7 and FIG. 8,the CDSs (sometimes comprising a true ORF) were determined from thesequences of the plus strand and the minus strand of a DNA fragmentincluding the threonine operon of Escherichia coli K-12 strain. Thedetails thereof are described below.

(1) Inputting of a Nucleotide Sequence to the Computer

The sequence of a DNA fragment including a threonine operon ofEscherichia coli K-12 strain (the accession number was AE000111 and thelength of the sequence was 10,596 base pairs) was obtained via theinternet from GenBank of the National Center for BiotechnologyInformation (hereinafter abbreviated as NCBI), which is the U.S.bio-information management organization, and was stored on the harddisk.

(2) Determination of the Provisional Start Codon Which Yields theLongest ORF

The stop codons TAA, TAG, and TGA which were present upon the plusstrand of the DNA fragment represented by AE000111 were searched for.The provisional start codon (any one of ATG, GTG, and TTG) which yieldsthe longest ORF for each found stop codon was determined, and the ORFswhich encode polypeptides of 200 amino acids or more in length wereselected. 4 ORFs were detected by said selecting process. Based upon theposition of the stop codon, each ORF was numbered as ORF-D1 to ORF-D4 inthe order from the 5′ terminal.

In the same manner for the minus strand, stop codons were searched forand the provisional start codons were determined to select 3 ORFs. Thosewhich encode polypeptides of 200 amino acids or more in length wereselected, and, based upon the position of the stop codons, each ORF wasnumbered as ORF-C1 to ORF-C3 in the order from the 3′ terminal of theminus strand.

(3) The Method of Calculating the Ribosome Binding Score and the Methodof Searching for the Start Codon

Since no ORFs existed upstream of ORF-D1, the start codon was determinedbased upon a ribosome binding score which was calculated by thefollowing method.

From data analysis of the genome sequence of Escherichia coli K-12strain in GenBank of the NCBI, it was confirmed that the 16S RNA 3′terminal sequence of Escherichia coli K-12 strain was 3′-AUUCCUCCA-5′.Next, according to the process shown in FIG. 5, the paired statesbetween the sequence 3′-UUCCUCCA-5′ in the above sequence and thesequence of 8 bases upstream of the start codon were investigated, thefollowing values were assigned thereto:

A pairing of G and C: +4

A pairing of A and U: +2.

A pairing of G and U: +1.

No pairing is seen at a base adjacent to a base where a pairing ispresent: −1.

If the start codon is GTG: −2.

If the start codon is TTG: −4.

and the sum of these numerical values were termed the ribosome bindingscore (S) . However, no value was assigned to the pairing in which thefinal base A of the 3′-UUCCUCCA-5′ participated.

As the 8 bases upstream of the start codon for which the paired statewas investigated, the 8 sequences of the region of the 17th base to the10th base upstream to the region of the 10th base to the 3rd baseupstream, were utilized. The ribosome binding scores were obtained forthese 8 sequences (hereinafter termed U-1 to U-8 in order of theposition from the upstream), and the one which yields the maximum valuewas selected.

If two or more sequences which yields the same maximum value existed,U-4, U-5, and U-6 had a priority, and the next U-7 and U-8 had apriority.

Using the method of the calculating the ribosome binding score, theribosome binding scores (S) were calculated for up to 10 candidates forthe start codon (ATG, GTG, or TTG) in the order from the 5′ terminal ofeach ORF.

The ribosome binding scores (S) for the five candidates, the first tothe fifth in the order from the 5′ terminal of each ORF, for the startcodon were compared with a threshold value V₁ (V₁=9.0), and if theribosome binding score for the candidate exceeded the threshold value(S>V₁), then this candidate for the start codon was selected as acandidate for the true start codon. If no ribosome binding score did notexceed V1 and a candidate for the start codon for which the scoreexceeds a threshold value V₂ (V₂=7.0) (S>V₂), then this candidate forthe start codon was selected as a candidate for the true start codon.

If there were no candidate for the start codon for which the scoreexceeded either of the threshold values, the ribosome binding scores forfive candidates, the sixth to the tenth from the 5′ terminal, for thestart codon were obtained, the ribosome binding score for each startcodon was checked from the first start codon, and, if the score wasgreater than or equal to a threshold value V₃ (V₃=6.0), then thecandidate for the start codon which yields this score was selected asthe start codon.

If the ribosome binding scores of all the candidates for the start codonwere less than the threshold value V₃ (S<V₃) and any candidate for thestart codon which must be selected did not exist, then that ORF wasdecided as a “false CDS”.

As the results, the candidate for the true start codon of ORF-D1 was the337th base.

(4) Search for ORFs Which have a Possibility to Form a PolycistronicTranscription Unit

Using the ORF based upon the candidate for the true start codon ofORF-D1 which was selected in the above described process (3), It wasexamined whether the ORFs which were present downstream of said ORF toform a polycistron.

Since the provisional start codon of ORF-D2 was present within 60 basesdownstream of the stop codon of the ORF which was decided and selectedas the candidate for a true ORF of ORF-D1 in the above described process(3), it was decided that ORF-D2 and ORF-D1 had a possibility to form apolycistronic transcription unit.

In the same manner, it was decided that ORF-D3 and ORF-D2 had apossibility to form a polycistronic transcription unit. However, ORF-D3and ORF-D4 did not have a possibility to form a polycistronictranscription unit.

(5) Determination of the Start Codons of the ORFs Which have aPossibility to Form a Polycistronic Transcription Unit

Since it was decided that ORFs -D1 to -D3 had a possibility to form apolycistronic transcription unit in the above described process (4),their true start codons were determined according to a “priority rankingrule for the start codon shown in FIG. 4” described below (herein termeda “rank function”, having a value which is an integer greater than orequal to 1).

The “vicinity” of the stop codon of the ORF-B is defined as the regionfor which the value of the rank function in FIG. 4 was from 1 to 8. If acandidate (ATG or GTG) for the start codon of the ORF-A was present inthis region, the candidate was determined as the “start codon”, whateverwas the value of the ribosome binding score obtained by the abovedescribed method. The candidates for the start codon were limited to ATGor GTG only in the region of the “vicinity”.

If no candidate for the start codon of the ORF-A was present in thevicinity of the stop codon of the ORF-B, it is examined whether acandidate (ATG, GTG, or TTG) for the start codon of the ORF-A waspresent in the region downstream of the vicinity of the stop codon ofthe ORF-B. Since, as a condition for the forming of a polycistron, thedistance between the stop codon of the ORF-B and the start codon of theORF-A should be within 60 bases, if a candidate for the start codon ofthe ORF-A was present in the region for which the value of the rankfunction was from 9 to 23, the ribosome binding score for each candidatefor the start codon was calculated using the method described in theprocess (3) above in ascending order of the value of the rank function,and, if said score was greater than or equal to a threshold value V₃(6.0), then this candidate was selected as the start codon.

If no candidate for the start codon of the ORF-A was present in thevicinity of the stop codon of the ORF-B, and furthermore in the regiondownstream of this vicinity, then it was examined whether a candidate(ATG, GTG, or TTG) for the start codon of the ORF-A was present in theregion upstream of the vicinity of the stop codon of the ORF-B. If acandidate for the start codon of the ORF-A was present in the regiondownstream of this vicinity, for which the value of the rank functionwas from 24 to R_(DN), then the ribosome binding score for eachcandidate for the start codon was calculated in ascending order of thevalue of the rank function, using the method described in the process(3) above, and if said score was greater than or equal to a thresholdvalue V₃ (6.0), then this candidate was selected as the start codon. Thevalue of R_(DN) was defined as the sum of the integer value which wasclosest to and less than a value of 10% of the number of amino acids ofthe ORF-A and the numerical value 23; however, if this value exceeded53, the value of RDN was defined as 53.

As the results, the position of the candidate for the true start codonof ORF-D2 was the base 2801, and the position of the candidate for thetrue start codon of the ORF-D3 was the base 3734.

(6) Determination of the Start Codon of an ORF Which can not beDetermined in the Above Described Processes

The true start codon for the ORF-D4, which was decided to have nopossibility to form a polycistronic transcription unit, was determinedin the manner described below.

The candidate for the true start codon was determined using the methodof (3) described above, and it was decided whether the ORF-A is a trueORF. The position of the candidate for the true start codon which wasdetermined are compared with the positions of the stop codons of all theORFs which are present upstream of the ORF-A, and it was examinedwhether the ORF-A overlaps with upstream ORFs. If the ORF-A overlapswith an ORF-B, then it was examined whether the length of theiroverlapping region was greater than or equal to 90 base pairs or greaterthan or equal to 10% of the length of the ORF-B.

If the length of the overlapping region was greater than or equal to 90base pairs, or greater than or equal to 10% of the length of the ORF-B,then this ORF-A was decided as a “false ORF”. If the length of theoverlapping region was less than or equal to 90 base pairs, or less thanor equal to 10% of the length of the ORF-B, then the ORF-A was selectedas a “candidate for a true ORF”.

As the results, the candidate for the true start codon of ORF-D4 was the8175-th base.

(7) Deleting an ORF Which is Included on the Same Strand

If the ORF-A was selected as a “candidate for a true ORF” in the abovedescribed process, then, it was examined whether all the “candidates fora true ORF” which were present upstream of the ORF-A were includedbetween the start codon and the stop codon of the ORF-A, and the ORFswhich were included were decided to be “false ORFs”.

By the above described process, it was decided that all the four ORFsfrom ORF-D1 to ORF-D4 were “candidates for a true ORF”.

(8) Delectation of the Candidates for an ORF Upon the minus strand

The “candidates for a true ORF” upon the minus strand were determined byexactly the same process as described above for selecting the“candidates for a true ORF” upon the plus strand. In the result, it wasdecided that all of the three ORFs from ORF-C1 to ORF-C3 were“candidates for a true ORF”.

The candidate for the true start codon of ORF-C1 was the 4162-th base;the candidate for the true start codon of ORF-C2 was the 6459-th base;and the candidate for the true start codon of ORF-C3 was the 7959-thbase. From these results, it was decided that ORF-C1 and ORF-C2 had apossibility to form a polycistronic transcription unit.

(9) Comparing the Transcription Units Upon the Plus Strand with Those onthe Minus Strand

If the distance between the start codon of the true ORF (CDS) and thestop codon of a CDS upstream thereof was within 90 base pairs, it wasdecided that both of the CDSs were present upon the same transcriptionunit.

The structures of the transcription units upon the plus strand and thoseupon the minus strand were investigated in this manner, and it was foundthat the number of transcription units upon the plus strand was two (oneof which was a polycistronic transcription unit), and that the number oftranscription units upon the minus strand was two (one of which was apolycistronic transcription unit).

The position of the start codon of the first CDS of each of thetranscription units and the position of the stop codon of its last CDSwere obtained, and it was examined whether there was a region ofoverlapping between the transcription unit upon the plus strand and thetranscription unit upon the minus strand.

Since ORF-C1 which was determined upon the minus strand had aoverlapping region with the ORF-D3 which was determined upon the plusstrand, the truth or falsity of the transcription units were decided bythe method described below. The length of a transcription unit was thedifference between the “position of the start codon of the first CDS”and the “position of the stop codon of the last CDS”.

If a transcription unit P upon the plus strand included a transcriptionunit Q upon the minus strand, then the transcription unit Q upon theminus strand was decided as a “false transcription unit”. Next, if atranscription unit P upon the plus strand was included in atranscription unit Q upon the minus strand, then the transcription unitP was decided as a “false transcription unit”.

Furthermore, if the transcription unit P and the transcription unit Qoverlapped, but the transcription unit P did not include thetranscription unit Q, and the transcription unit Q did not include thetranscription unit P, then the length of the transcription unit P wascompared with that of the transcription unit Q, and the one whose lengthwas shorter was decided as a “false transcription unit”.

Finally, if the truth or the falsity of the transcription unit P and thetranscription unit Q was not able to be decided, then both of thetranscription units were defined as true.

Using the above method of deciding on the truth or falsity of thetranscription units, the monocistronic transcription unit ORF-C1 (theregion from base 4162 to base 3512) upon the minus strand was decided tobe a “false transcription unit” (refer to Table 1).

(10) Outputting the Information about the Determined CDSs

The information about the CDSs which was determined by the processdescribed above was outputted as a file upon the hard disk. Theoutputted information was shown in Table 1. TABLE 1 plus informationstrand/ position position about a truth or ORF minus of start of stoptranscription falsity number strand codon codon unit structure of CDSORF-D1 + 337 2799 2 true ORF-D2 + 2801 3733 3 true ORF-D3 + 3734 5020 4true ORF-D4 + 8175 9191 1 true ORF-C1 − 4162 3512 false ORF-C2 − 64595683 4 true ORF-C3 − 7959 6529 2 true

The information in Table 1 was compared with the annotation informationwhich is appended to the sequence registered in GenBank of the NCBIunder accession number AE000111, and it was understood that the numberof CDSs of 200 amino acids or more was 6 in both, and that all the 6CDSs which were determined were identical between both. The positions ofthe start codons of 5 of the CDSs, with the exception of ORF-D4, wereidentical with the annotation information registered in GenBank.Furthermore it was shown that information about a transcription unitstructure, which was not present in the annotation informationregistered in GenBank, was also obtained in the present invention.

In the table, for information about the structure of a transcriptionunit, the label “1” was appended to a monocistronic CDS, and the labels“2”, “4”, and “3” were respectively appended to the first CDS, to thelast CDS, and to a CDS which was present internally between the firstCDS and the last CDS.

EXAMPLE 2

Determination of the CDSs from the Sequence of a DNA Fragment whichIncludes a Tryptophan Operon of Escherichia coli K-12 Strain

The CDSs were determined from the sequence of a DNA fragment whichincludes a tryptophan operon of Escherichia coli K-12 strain accordingto the method described in Example 1.

(1) Inputting of a Nucleotide Sequence to the Computer

The entire genome sequence of Escherichia coli K-12 strain was obtainedvia the internet from GenBank of the NCBI (the accession number wasU00096 and the length of the sequence was 4,639,221 base pairs), and wasstored upon the hard disk. Furthermore, the sequence from base 1,314,001to base 1,321,021, which included a tryptophan operon, was extractedfrom this sequence.

(2) Determination of the CDSs and the Start Codons

From the plus strand and the minus strand of the sequence abovedescribed in (1), the CDSs which encode polypeptides of 200 amino acidsor more in length were determined according to the method described in(2) of Example 1,

Although ORFs were not be determined from the plus strand, a total of 5ORFs were determined from the minus strand. Each ORF was numbered asORF-C11 to ORF-C15 in the order from the 3′ terminal of the minusstrand, based upon the position of its stop codon.

Among the candidates for ORF upon the minus strand, ORFs which had thepossibility to form a polycistronic transcription unit were searched foraccording to the method described in (4) of Example 1, and the truth orfalsity of the 5 ORFs (ORF-C11 to ORF-C15) were decided and their startcodons were determined according to the method for calculating theribosome binding score for each ORF and for determining the true startcodon which was described in (3) to (6) of Example 1. Furthermore, CDSswhich were included in other CDS upon the same strand were deleted.

As the results, 5 CDSs and 1 transcription unit were determined from theminus strand.

(3) Outputting the Results about the Determined CDSs and the Evaluationof these Results

The results about the CDSs which were finally determined were outputtedas a text file on the hard disk. These results are shown in Table 2.TABLE 2 plus information truth strand/ position position about a or ORFminus of start of stop transcription falsity number strand codon codonunit structure of CDS ORF-C11 − 1246 440 4 true ORF-C12 − 2439 1246 3true ORF-C13 − 3812 2451 3 true ORF-C14 − 5408 3813 3 true ORF-C15 −6970 5408 2 true

The information in Table 2 was compared with the annotation informationwhich is appended to the genome sequence (accession number U00096) ofthe Escherichia coli K-12 strain which is registered in GenBank of theNCBI, and it was understood that the number of CDSs of 200 amino acidsor more of the tryptophan operon region was 5 in both, and that all the5 CDSs which were determined were identical completely with the CDSswhich are registered in GenBank. The positions of the start codons ofthese 5 CDSs were identical with the annotation information registeredin GenBank. Furthermore it was shown that information about thestructure of a transcription unit, which was not present in theannotation information registered in GenBank, was also obtained inrelation to transcription units in the present invention.

EXAMPLE 3 Determination of the CDSs from the Sequence of a DNA Fragmentwhich Includes a Ribosomal Protein Operon of Escherichia coli K-12Strain

The CDSs which encode polypeptides of 200 amino acids or more in lengthwere determined from the sequence of a DNA fragment which includes aribosomal protein operon of Escherichia coli K-12 strain according tothe method described in Example 1.

(1) Inputting of a Nucleotide Sequence to the Computer

The sequence of a DNA fragment including a ribosomal protein operon ofEscherichia coli K-12 strain was obtained via the internet from GenBankof the NCBI (the accession number was AE000408 and the length of thesequence was 14,659 base pairs), and was stored upon the hard disk.

(2) Determination of the CDSs and the Start Codons

The stop codons and the provisional start codons were determined fromthe sequence above described in (1) according to the method described in(2) of Example 1.

A total of 6 ORFs were determined from the plus strand, and a total of 5ORFs were determined from the minus strand. Each ORF was numbered asORF-D21 to ORF-C26 in the order from the 5′ terminal of the plus strand,and numbered as ORF-C21 to ORF-C25 in the order from the 3′ terminal ofthe minus strand.

Among the candidates for ORF upon the plus strand and upon the minusstrand, ORFs which had a possibility to form a polycistronictranscription unit were searched for according to the method describedin (4) of Example 1, and the truth or falsity of the 11 ORFs weredecided and their start codons were determined according to the methodsfor calculating the ribosome binding score for each ORF and fordetermining the start codon which were described in (3) to (6) ofExample 1. Furthermore CDSs which were included in other CDS upon thesame strand were deleted according to the method described in (7) and(8) of Example 1.

As the results, 6 CDSs and 5 transcription units were determined fromthe plus strand, and 5 CDSs and 4 transcription units were determinedfrom the minus strand.

Furthermore, overlapping between selected transcription units upon theplus strand and those upon the minus strand was investigated, and thetruth or falsity of the candidate for a true ORF was decided accordingto the method described in (9) of Example 1.

AS the result, the 7 CDSs ORF-D21, ORF-D22, ORF-D24, ORF-D25, ORF-D26,ORF-C21, and ORF-C22 were decided to be true, and the 4 CDSs ORF-D23,ORF-C23, ORF-C24, and ORF-C25 were decided to be false.

(3) Outputting the Results about the Determined CDSs and the Evaluationof these Results

The results about the 7 true CDSs and the 4 false CDSs which werefinally determined were outputted as a text file on the hard disk. Theseresults are shown in Table 3. TABLE 3 plus information truth strand/position position about a or ORF minus of start of stop transcriptionfalsity number strand codon codon unit structure of CDS ORF-D21 + 21482774 2 true ORF-D22 + 2825 3446 4 true ORF-D23 + 6763 7418 falseORF-D24 + 7910 8961 1 true ORF-D25 + 9172 9809 true ORF-D26 + 9743 10446true ORF-C21 − 1624 293 1 true ORF-C22 − 7410 6709 1 true ORF-C23 − 88918070 false ORF-C24 − 9813 9208 4 false ORF-C25 − 10453 9824 2 false

The information in Table 3 was compared with the annotation informationwhich is appended to the sequence of accession number AE000408registered in GenBank of the NCBI, and it was understood that the numberof CDSs of 200 amino acids or more in ribosomal protein operon regionswhich was determined by the method of the present invention were 7, andthat the number of the CDSs which were registered in GenBank of the NCBIwas 5. Among the determined 7 CDSs, CDSs identical with the annotationinformation of GenBank were the 2 CDSs ORF-C21 and ORF-C22. Thepositions of the start codons of these 2 CDSs were identical with theannotation information registered in GenBank. Furthermore it was shownthat information about a transcription unit structure, which is notpresent in the annotation information registered in GenBank, was alsoobtained in the present invention.

EXAMPLE 4 Determination of the CDSs which Encode Polypeptides of 34Amino Acids or More from the Sequence of a DNA Fragment which Includes aRibosomal Protein Operon of Escherichia coli K-12 Strain (1)

The CDSs which encode polypeptides of 34 amino acids or more in lengthwere determined from the sequence of a DNA fragment which includes aribosomal protein operon of Escherichia coli K-12 strain According tothe method described in Example 1.

(1) Inputting of a Nucleotide Sequence to the Computer

The sequence of a DNA fragment including a ribosomal protein operon ofEscherichia coli K-12 strain was obtained via the internet from GenBankof the NCBI (the accession number was AE000408 and the length of thesequence was 14,659 base pairs), and was stored upon the hard disk.

(2) Determination of the CDSs of 34 Amino Acids or More and the StartCodons

From the above described sequence (1), the stop codons and theprovisional start codons of the ORFs which encode polypeptides of 34amino acids or more were determined according to the method described in(2) of Example 1.

A total of 53 ORFs were determined from the plus strand, and a total of38 ORFs were determined from the minus strand. Each ORF were numbered asORF-D101 to ORF-D153 in the order from the 5′ terminal of the plusstrand, and numbered as ORF-C101 to ORF-C138 in the order from the 3′terminal of the minus strand, based upon the position of its stop codon.

Among the candidates for ORF upon the plus strand and upon the minusstrand, ORFs which had a possibility to form a polycistronictranscription unit were searched for according to the method describedin (4) of Example 1, and the truth or falsity of the 91 ORFs weredecided and their start codons were determined by using the method ofcalculating the ribosome binding score for each ORF and the method ofsearching for the start codon which were described in (3) to (6) ofExample 1. Furthermore, CDSs which were included in other CDS upon thesame strand were deleted according to the method described in (7) and(8) of Example 1. As the results, 22 CDSs and 13 transcription unitswere determined from the plus strand, and 23 CDSs and 2 transcriptionunits were determined from the minus strand. Furthermore, overlappingbetween selected transcription units upon the plus strand and those uponthe minus strand was investigated, and the truth or falsity of thecandidates for a true ORF was decided according to the method describedin (9) of Example 1.

As the results, the 23 CDSs on the minus strand were decided to be true,and that the 22 CDSs on the plus strand were decided to be false.

(3) Outputting the Results About the Determined CDSs and the Evaluationof these Results

The results about the true ORFs (CDSs) which was finally determined wereoutputted as a text file on the hard disk. These results are shown inTable 4. TABLE 4 plus information truth strand/ position position abouta or ORF minus of start of stop transcription falsity number strandcodon codon unit structure of CDS ORF-C101 − 261 145 4 true ORF-C102 −1624 293 3 true ORF-C105 − 2066 1632 3 true ORF-C107 − 2249 2070 3 trueORF-C109 − 2756 2253 3 true ORF-C111 − 3124 2771 3 true ORF-C113 − 36673134 3 true ORF-C114 − 4072 3680 3 true ORF-C116 − 4411 4106 3 trueORF-C118 − 4965 4426 3 true ORF-C119 − 5294 4980 2 true ORF-C120 − 56765305 4 true ORF-C121 − 6095 5841 3 true ORF-C122 − 6286 6095 3 trueORF-C123 − 6696 6286 3 true ORF-C125 − 7410 6709 3 true ORF-C127 − 77607428 3 true ORF-C129 − 8053 7775 3 true ORF-C130 − 8891 8070 3 trueORF-C133 − 9211 8909 3 true ORF-C135 − 9813 9208 3 true ORF-C137 − 104539824 3 true ORF-C138 − 10797 10486 2 true

The information in Table 4 was compared with the annotation informationwhich is appended to the sequence of accession number AE000408registered in GenBank of the NCBI, and it was understood that the numberof CDSs of 34 amino acids or more in ribosomal protein operon regionswas 23 in both, and that all the 23 determined CDSs were identicalbetween both. The positions of the start codons of these 23 CDSs wereidentical with the annotation information registered in GenBank.Furthermore it was shown that information about a transcription unitstructure, which is not present in the annotation information registeredin GenBank, was also obtained in the present invention.

EXAMPLE 5 Determination of the CDSs which Encode a Polypeptide of 34Amino Acids or More from the Sequence of a DNA Fragment which Includes aThreonine Operon of Escherichia coli K-12 Strain (1)

In order to further enhance the accuracy of determination of CDSs whichencodes polypeptides of 34 amino acids or more by the method describedin Example 4, the process of analysis based upon the determined CDSinformation which encodes polypeptides of 200 amino acids was added,then the CDS which encodes polypeptides of 34 amino acids or more weredetermined at high accuracy as described below.

(1) Searching for the ORFs which Encode Polypeptides of 34 Amino Acidsor More

According to the method of Example 4, from the sequence of a DNAfragment including a ribosomal protein operon of Escherichia coli K-12strain (the accession number was AE000111 and the length of the sequencewas 10,59.6 base pairs), the ORFs which encode polypeptides of 34 ormore amino acids upon the plus strand were searched for.

As the results, 47 ORFs were obtained from the plus strand, and 51 ORFswere obtained from the minus strand. The positions of the start codonsof these ORFs were determined and the candidates for a true ORF wereselected according to the method described in Example 4.

However, the ORF whose length is less than 180 base pairs, and moreoverthe start codon is TTG, was not selected as the CDS.

(2) Deciding Truth or Falsity of the CDSs with a Shadow DiscriminationFunction

For each of the true ORF candidates which was selected by the abovedescribed process (1), the truth or the falsity of each ORF was decidedfrom the value of a shadow discrimination function as described below.

The frequency of appearance of codons of the 6 CDSs which was determinedin Example 1 and encodes polypeptides of 200 amino acids or more andwere present in the sequence of the DNA fragment including a threonineoperon of Escherichia coli K-12 strain were obtained.

Based on the frequency of appearance of codons of the 6 CDSs which wasdetermined by the process (1) and encodes polypeptides of 200 aminoacids or more, 13 combinations of codons were selected, wherein “thefrequency of times of a codon appearing in CDSs which were decided to be“true CDSs” was high, and the frequency of times of a codon which hasthe complementary sequence to the 3-base sequence of said codonappearing in said CDSs was low”, and the values of the shadowdiscrimination function were obtained according to the method describedbelow,.

That is, when the 64 types of codons:

TTA, CTA, TCA, TTT, TTC, TTG, TCT, TCC, TCG, TAT, TAC, TGT, TGC, TGG,CTT, CTC, CTG, CCT, CCC, CCG, CAT, CAC, CGT, CGC, ATT, ATC, ACT, ACC,AAC, AGC, GTC, GCC,

TAA, TAG, TGA, AAA, GAA, CAA, AGA, GGA, CGA, ATA, GTA, ACA, GCA, CCA,AAG, GAG, CAG, AGG, GGG, CGG, ATG, GTG, ACG, GCG, AAT, GAT, AGT, GGT,GTT, GCT, GAC, GGC

was arranged in the above order, a number was appended to each codon sothat the first codon was TTA, and the second codon was CTA. According tothe following formula, the values of y_(i) and Y_(i+32) (i is a positiveinteger less than or equal to 32) were calculated, the above described64 types of codon were rearranged in descending order of these values,and a number was appended to each codon in order. $\begin{matrix}{{y_{i}\left( {{\sum\limits_{t = 1}^{6}\quad C_{i}^{t}} - {\sum\limits_{t = 1}^{6}\quad C_{i + 32}^{t}}} \right)}/{\sum\limits_{i = 1}^{6}\quad{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}} & \left( {2 - 1} \right)\end{matrix}$(herein i is a positive integer less than or equal to 32) or$\begin{matrix}{y_{t + 32} = {\left( {{\sum\limits_{t = 1}^{6}\quad C_{i + 32}^{t}} - {\sum\limits_{t = 1}^{6}\quad C_{i}^{t}}} \right)/{\sum\limits_{t = 1}^{6}\quad{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}} & \left( {3 - 1} \right)\end{matrix}$(herein i is a positive integer less than or equal to 32)

wherein the number of times of the i-th codon appearing in the t-th CDSis expressed as:C^(t) _(j)

Next, the top 13 codons for which the value of y_(i) or Y_(i+32) islarge and the bottom 13 codons for which the value of y_(i) or Y_(i+32)is small were selected, and the value of the shadow discriminationfunction Sd_(A) for the ORF-A which was the candidate for true ORF wasobtained by the following formula: $\begin{matrix}{{Sd}_{A} = {2 \times {\sum\limits_{i = 1}^{13}\quad{C_{i}^{A}/\left( {{\sum\limits_{i = 1}^{13}\quad C_{i}^{A}} + {\sum\limits_{i = 52}^{64}\quad C_{i}^{A}}} \right)}}}} & \left( {4 - 1} \right)\end{matrix}$

herein the value of Sd_(A) is defined as 1 if$\left( {{\sum\limits_{i = 1}^{13}\quad C_{i}^{A}} + {\sum\limits_{i = 52}^{64}\quad C_{i}^{A}}} \right)$

was zero.

Based on the Sd_(A) which was obtained in this manner, the ORF-A wasdecided to be a true CDS when the following condition was satisfied.That is to say, the ORF is decided to be a true ORF (CDS): if “Sd_(A) isgreater than or equal to 0.9”, in the case of an ORF which encodes apolypeptide of 100 amino acids or more; if “Sd_(A) is greater than orequal to 1.0”, in the case of an ORF which encodes a polypeptide of 60amino acids or more and 99 amino acids or less; and if “Sd_(A) isgreater than or equal to 1.0”, in the case of an ORF which encodes apolypeptide of 34 amino acids or more and 59 amino acids or less.

(3) Selection of the CDSs when Two ORFs Overlap Upon the Same Strand

When two ORFs which were selected by the above described processoverlapped upon the same strand, the CDSs were selected out in thefollowing manner.

When the 5′ terminal side of an ORF-A which is the candidate for a trueORF and the 3′ terminal side of an ORF-B which is present upstream ofsaid ORF-A overlapped, the truth or falsity of these ORFs was decided bythe following method.

If the length of the overlapping region of the ORF-A and the ORF-B wasless than or equal to 90 base pairs, was less than or equal to 10% ofL_(A), the length of the ORF-A, and was less than or equal to L_(B), thelength of the ORF-B, then both of the ORFs were selected as true CDSs.

If the length of the overlapping region did not satisfy the abovedescribed condition, the truth or falsity of the ORFs was decided by thefollowing method.

If L_(A) was greater than L_(B), and the following formula (9) wassatisfied, then the ORF-A was selected as a CDS:L _(B) <L _(A)×(L _(A)+12000)/20000  (9)

If L_(A) was smaller than L_(B), and moreover the following formula (10)is satisfied, then the ORF-B was selected as a CDS:L _(A) <L _(B)×(L _(B)+12000)/20000  (10)

When the relationship between the length of the ORF-A and the length ofthe ORF-B did not satisfy the formula (9) or the formula (10), then thetruth or falsity of the CDS is decided according to the magnitude of thevalues of the shadow discrimination function of the ORF-A and the ORF-B.

If an ORF was selected as a “candidate for a true ORF” by the abovedescribed process, then, it was examined whether all the “candidates fora true ORF” which are upstream of said ORF were included between thestart codon and the stop codon of said ORF, and the ORFs which wereincluded were all decided as “false ORFs”.

As the results of decision by the above method, 15 among the 47 ORFswhich were present upon the plus strand, and 16 among the 51 ORFs whichwere present upon the minus strand, were selected as CDSs.

(4) Comparison of Transcription Units Upon the Plus Strand and the MinusStrand

When, in the process described above, the distance between a start codonof a CDS which was selected as a true ORF and the stop codon of a CDSwhich was present upstream thereof was within 90 base pairs, then it wasdecided that both the CDSs were present in the same transcription unitto investigate the structure of the transcription unit.

As the result, it was understood that the number of transcription unitsupon the plus strand was 6, and the number of transcription units uponthe minus strand was 9.

Next, the position of the start codon of the first CDS and the positionof the stop codon of the last CDS of each of the transcription unitswere obtained, and it was examined whether there was an overlappingregion between the transcription unit of the plus strand and thetranscription unit of the minus strand. If a transcription unit P of theplus strand included a transcription unit Q of the minus strand, or if atranscription unit P of the plus strand was included in a transcriptionunit Q of the minus strand, the following process is performed fordeciding on the truth or falsity of the transcription units. The lengthof a transcription unit was the difference between the “position of thestart codon of the first CDS” and the “position of the stop codon of thelast CDS”, and the lengths of the transcription unit P and of thetranscription unit Q are termed L_(P) and L_(Q) respectively.

If L_(P) was greater than L_(Q), and the following formula (11) wassatisfied, then the transcription unit P was decided as a truetranscription unit, and the transcription unit Q was decided as a falsetranscription unit.L _(Q) <L _(P)×(L _(P)+14000)/20000  (11)

When L_(P) was smaller than L_(Q), and the following formula (12) wassatisfied, then the transcription unit P was decided as a falsetranscription unit, and the transcription unit Q was decided as a truetranscription unit.L _(P) <L _(Q)×(L _(Q)+14000)/20000  (12)

When the relationship between the lengths of the transcription unit Pand the transcription unit Q did not satisfy the formula (11) or theformula (12), then, all the CDSs which form the transcription unit P andthe transcription unit Q were linked up respectively, the values of theshadow discrimination function for these linked coding regions werecalculated according to the method described in (2) of Example 5, andthe one for which this value is the greater is decided as a truetranscription unit.

(5) The Comparison of the CDSs Upon the Plus Strand and the MinusStrand, and the Processing when the CDSs Upon the Plus Strand and theMinus Strand are Overlapped at the 5′ Terminal Side

For the CDSs which form transcription units and were selected in theabove described manner, the truth or the falsity of two CDSs which arein an inclusion relationship is decided by the method described in (4)above.

After the decision, for the CDSs which were selected as true CDSs, it isexamined whether a CDS upon the plus strand (termed CDS-A) and a CDSupon the minus strand (termed CDS-B) were overlapped at the 5′ terminalside.

The truth or the falsity of two CDSs (CDS-A and CDS-B) for whichoverlapping was prudent is decided according to the process describedbelow.

Determination of a new start codon was carried out in the downstreamfrom the next codon to the start codon of the CDS-A, utilizing themethod described in (3) of Example 1. In the same manner, Determinationof a new start codon was carried out in the downstream from the nextcodon to the start codon of the CDS-B, utilizing the method described in(3) of Example 1. If a new start codon was determined, it was examinedwhether a combination of start codons, also including the previous startcodon, existed wherein the CDS-A and the CDS-B do not overlap. Thepositions of the original start codon and the newly determined startcodon were compared together, and, when it was possible to avoidoverlapping of the CDS-A and the CDS-B, both of the CDS-A and the CDS-Bwere decided as true CDSs. When it was not possible to avoid overlappingof the CDS-A and the CDS-B, then the truth or falsity of the CDSs wasdecided by the following method.

If L_(A) was greater than L_(B), wherein lengths of the CDS-A and theCDS-B were termed L_(A) and L_(B) respectively, and if the followingformula (9) was satisfied, then the CDS-B was decided as a “false CDS”.L _(B) <L _(A)×(L _(A)+12000)/20000  (9)

If L_(A) was smaller than L_(B), and the following formula (10) wassatisfied, then the CDS-A was decided as a “false CDS”.L _(A) <L _(B)×(L _(B)+12000)/20000  (10)

If it was not possible to decide on the truth or the falsity of theCDS-A and the CDS-B even by either of the above described methods, thenthe truth or the falsity of the CDSs was decided on by comparing thevalues of the shadow discrimination function for each of the CDSsaccording to a method described above.

(6) The Processing when the CDSs of the Plus Strand and the Minus StrandOverlap at the 3′ Terminal Side

If the CDS-A of the plus strand and the CDS-B of the minus strandoverlapped at their respective 3′ terminal sides, the CDS-A did notinclude the CDS-B, and the CDS-B did not include the CDS-A, then thetruth or the falsity of the CDS-A and of the CDS-B were decided by usinga method above described in process (5).

First, if the length of the overlapping region of the CDS-A and theCDS-B was less than 20% of each of the CDSs, then both of the CDSs weredecided to be true CDSs. If the length of the overlapping region wasgreater than or equal to 20% of one of the CDSs, and less than 20% ofthe length of the other one of the CDSs, then it was decided that theformer CDS was a false CDS. If it was not possible to decide on thetruth or the falsity of the CDSs by these conditions, then the truth orthe falsity of the CDSs was decided according to the method using theformula (9) and the formula (10) described above.

By the above described processing, 7 true ORFs (CDSs) were determinedfrom the plus strand, and 5 CDSs were determined from the minus strand.Each ORF was numbered as ORF-D201 to ORF-D207 in order from the 5′terminal of the plus strand, and numbered as ORF-C201 to ORF-C205 inorder from the 3′ terminal of the minus strand, by the position of itsstop codon.

(7) Outputting of the Results About the Determined CDSs and theEvaluation of these Results

By adding the above described processing, 7 ORFs were determined fromthe plus strand and 5 ORFs were determined from the minus strand. EachORF was numbered as ORF-D201 to ORF-D207 in order from the 5′ terminalof the plus strand, and numbered as ORF-C201 to ORF-C205 in order fromthe 3′ terminal of the minus strand, by the position of its stop codon.The results about the 12 CDSs which was finally determined wereoutputted as a text file upon the hard disk. These results are shown inTable 5. TABLE 5 plus information truth strand/ position position abouta or ORF minus of start of stop transcription falsity number strandcodon codon unit structure of CDS ORF-D201 + 337 2799 2 true ORF-D202 +2801 3733 3 true ORF-D203 + 3734 5020 3 true ORF-D204 + 5088 5235 4 trueORF-D205 + 7986 8141 2 true ORF-D206 + 8175 9191 4 true ORF-D207 + 93069893 1 true ORF-C201 − 5657 5310 4 true ORF-C202 − 6459 5683 3 trueORF-C203 − 7959 6529 2 true ORF-C204 − 10452 9928 4 true ORF-C205 −10571 10452 2 true

The information in Table 5 was compared with the annotation informationwhich is appended to the sequence of accession number AE000111registered in GenBank of the NCBI, and it was understood that the numberof CDSs of 34 amino acids or more was determined by the method of thepresent invention was 12, while the number of the CDSs which wereregistered in GenBank of the NCBI was 9. The CDSs, among the determined12 CDSs, CDSs identical with the annotation information of GenBank werethe 8 CDSs ORF-D201, ORF-D202, ORF-D203, ORF-D206, ORF-D207, ORF-C202,ORF-C203, and ORF-C205. The positions of the start codons of 6 of these8 CDSs were identical with the annotation information registered inGenBank. Furthermore it was shown that information about a transcriptionunit structure, which is not present in the annotation informationregistered in GenBank, was also obtained in the present invention.

From the results of this example, it was shown that the accuracy of CDSdetermination was enhanced by combining the “method of determining agenetic structure using a shadow discrimination function” of the presentinvention with the “method of determining a genetic structure from theviewpoint of a transcription unit structure” of the present

EXAMPLE 6 Determination of the CDSs which Encode a Polypeptide of 34Amino Acids or More From the Sequence of a DNA Fragment which Includes aRibosomal Protein Operon of Escherichia coli K-12 strain (2)

According to the method described in Example 5, the CDSs which encodepolypeptides of 200 amino acids or more were determined at high accuracyfrom the sequence of a DNA fragment including a ribosomal protein operonof Escherichia coli K-12 strain (the accession number was AE000408, andthe length of the sequence was 14,659 base pairs), and the CDSs whichencode polypeptides of 34 amino acids or more were determined based uponthe determined CDS information by carrying out the process of the“method of determining a genetic structure using a shadow discriminationfunction”.

As the results, the same results as given in Example 4 were obtained.

EXAMPLE 7 Determination of the CDSs which Encode a Polypeptide of 34Amino Acids or More from the Sequence of a DNA Fragment which Includes aThreonine Operon of Escherichia coli K-12 Strain (2)

In this Example, it was shown that the accuracy of CDS determination isfurther enhanced by combining the “method of deciding upon the truth orfalsity of CDSs by using a coding potential” with the method describedin Example 5 which was combined the “method of determining a geneticstructure from the viewpoint of a transcription unit structure” and the“method of determining a genetic structure using a shadow discriminationfunction”. As an indicator of coding potential, as shown below, there isutilized a value which was obtained by calculation formula (hereinafterreferred to as the “code function” or Cd) which is based upon the numberof appearances of codons which appear often in true CDSs, and of codonswhich appear rarely therein.

(1) Deciding Upon the Truth or the Falsity of the ORFs According to aMethod of Calculating the Value of a “Code Function”

the value of y_(i) was obtained from the formula (6) below for thenumber T of coding regions which was determined: $\begin{matrix}{y_{i} = {\sum\limits_{t = 1}^{T}\quad{C_{i}^{t}/{\sum\limits_{t = 1}^{T}\quad{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}}} & (6)\end{matrix}$

wherein the number of times of the i-th codon appearing in the t-thcoding region was expressed asC^(t) _(j)

next, the 64 types of codons were rearranged in descending order of yi,“top m codons for which the value of yi is large” and “bottom m codonsfor which the value of yi is large, excluding the translation stopcodon” were selected, and the value of Cd_(A) of a specified ORF (termedORF-A) was obtained from the formula (7) below: $\begin{matrix}{{Cd}_{A} = {2 \times {\sum\limits_{i = 1}^{m}\quad{C_{i}^{A}/\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{61}\quad C_{i}^{A}}} \right)}}}} & (7)\end{matrix}$

wherein, the value of Cd_(n) was defined as 1 if$\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{64}\quad C_{i}^{A}}} \right)$

is zero,

[herein T is an integer greater than or equal to 2, i is a positiveinteger less than or equal to 64, t is a positive integer less than orequal to T, and m is an integer from 5 to 20].

In this Example, m was 15.

Based on the Cd_(A) which was obtained in this manner, the ORF-A wasdecided to be a true CDS when the following condition was satisfied.

That is, said ORF was decided as a true CDS: if “Cd_(A) was greater thanor equal to 1.0”, in the case of an ORF of 100 amino acids or more; if“Cd_(A) was greater than or equal to 1.1”, in the case of an ORF of 60amino acids or more and of 99 amino acids or less; and if “Cd_(A) isgreater than or equal to 1.4”, in the case of an ORF of 34 amino acidsor more and of 59 amino acids or less.

When the value of the “shadow discrimination function” was obtained, thetruth or the falsity of the CDSs was decided based upon a value(Sd_(A)×Cd_(A)) which is the product of the value Sd_(A) of the “shadowdiscrimination function” and the value Cd_(A) of the “code function” ofthe ORF-A.

In other words, in the case of an ORF of 100 amino acids or more, it wasdecided that said ORF included a true CDS if “Sd_(A)×Cd_(A) is greaterthan or equal to 1.1”. In the case of an ORF of 60 amino acids or moreand of 99 amino acids or less, it was decided that said ORF included atrue CDS if “Sd×Cd_(A) is greater than or equal to 1.4”. In the case ofan ORF of 34 amino acids or more and of 59 amino acids or less, it wasdecided that said ORF included a true CDS if “Sd_(A)×Cd_(A) is greaterthan or equal to 1.6”.

(2) Determination of the CDSs which Encode Polypeptides of 34 AminoAcids or More from the Sequence of a DNA Fragment which Includes aThreonine Operon

According to the method described in Example 5, except for the processwhich utilizes the value of the “shadow discrimination function”,determination of the CDSs which encode polypeptides of 34 amino acids ormore from the plus strand and the minus strand of the sequence of a DNAfragment which included a threonine operon of Escherichia coli K-12strain (the accession number was AE000111, and the length of thesequence was 10,596 base pairs) was carried out.

In this Example, in the process of Example 5 using the “shadowdiscrimination function”, the truth or falsity of the CDSs was decidedby a decision method which was based upon the value of the “codefunction” and a decision method which was based upon the value which wasthe product of the “shadow discrimination function” and the “codefunction” which were disclosed in (1) above, in addition to the decisionmethod based upon the value of the “shadow discrimination function”.

The CDSs were selected according to the process described above. As theresults, 6 CDSs (ORF-D301 to ORF-D306) were determined from the plusstrand, and 3 CDSs (ORF-C301 to ORF-C303) were determined from the minusstrand.

(3) Outputting the Results about Determined CDS and the Evaluation ofthese Results

The results about the 9 CDSs which were finally determined wereoutputted as a text file upon the hard disk. These results were shown inTable 6. TABLE 6 plus information strand/ position position about atruth or ORF minus of start of stop transcription falsity number strandcodon codon unit structure of CDS ORF-D301 + 337 2799 2 true ORF-D302 +2801 3733 3 true ORF-D303 + 3734 5020 4 true ORF-D304 + 5234 5528 4 trueORF-D305 + 8238 9191 1 true ORF-D306 + 9306 9893 1 true ORF-C301 − 64595683 1 true ORF-C302 − 7959 6529 4 true ORF-C303 − 10494 9928 2 true

The information in Table 6 was compared with the annotation informationwhich is appended to the sequence of accession number AE000111registered in GenBank of the NCBI, and it was understood that the numberof CDSs of 34 amino acids or more which was determined by the method ofthe present invention was 9 in both, and that these 9 CDSs were allidentical between both. The positions of the start codons of these 9CDSs were identical with the annotation information registered inGenBank. Furthermore it was shown that, information about atranscription unit structure, which is not present in the annotationinformation registered in GenBank, also obtained in the presentinvention.

From comparison of the results of Example 5 with the results of thisExample, it was apparent that the accuracy of CDS determination wasfurther enhanced by combining a “method of deciding on the truth or thefalsity of a CDS by utilizing the coding potential” with the methodwhich is the combination of the “method of determining a geneticstructure” from the viewpoint of a transcription unit structure and“method of determining a genetic structure using a shadow discriminationfunction” of the present invention.

EXAMPLE 8 Determination of the CDSs which Encode a Polypeptide of 34Amino Acids or More from the Entire Genome Sequence of Escherichia coliK-12 Strain (1)

According to the method of Examples 1 and 5, in other words, utilizingthe “method of determining a genetic structure from the viewpoint of atranscription unit structure”, the CDSs which encode polypeptides of 34or more amino acids were determined from the entire genome sequence ofEscherichia coli K-12 strain MG1655.

(1) Determination of the CDSs which Encode a Polypeptide of 34 AminoAcids or More from the Entire Genome Sequence of Escherichia coli

According to the method described in Example 1, 2679 and 2661 CDSs whichencode polypeptides of 34 amino acids or more were determined from theplus strand and from the minus strand, respectively, of the entiregenome sequence of Escherichia coli K-12 strain (the accession number isU00096, and the length of the sequence is 4,639,221 base pairs).Furthermore, the positions of the stop codons of a total of 5340 of thedetermined CDSs were compared with the annotation information which wasappended to the genome sequence of Escherichia coli K-12 strain andregistered in GenBank of the NCBI, and it was understood that, among the5340 CDSs, the stop codons of 3391 of the CDSs were identical with thethose of CDSs registered in GenBank. The number of CDSs which encodepolypeptides of 34 amino acids or more of Escherichia coli K-12 strainwhich is registered in GenBank of NCBI was 4274.

From the above results, when the annotation information of Escherichiacoli K-12 strain which is registered in GenBank of NCBI was correct andthe number (4289) of all the CDSs of Escherichia coli K-12 strain wasused as the denominator, the accuracy of determination was 0.790 in thesensitivity, and was 0.635 in the specificity.

EXAMPLE 9 Determination of the CDSs which Encode Polypeptides of 34Amino Acids or More from the Entire Genome Sequence of Escherichia coliK-12 Strain (2)

According to the method described in Example 6, in other words, bycombining the “method of determining a genetic structure from theviewpoint of a transcription unit structure” and the “method ofdetermining a genetic structure using a shadow discrimination function”,the CDSs which encode polypeptides of 34 amino acids or more weredetermined from the entire genome sequence of Escherichia coli K-12strain MG1655.

According to the method described in Example 1, 2607 and 2648 CDSs whichencode polypeptides of 34 amino acids or more were determined from theplus strand and from the minus strand respectively of the entire genomesequence of Escherichia coli K-12 strain (the accession number isU00096, and the length of the sequence is 4,639,221 base pairs).

The positions of the stop codons of a total of 5255 of the determinedCDSs were compared with the annotation information which was appended tothe genome sequence of Escherichia coli K-12 strain and registered inGenBank of the NCBI, and it was understood that, among the 5255 CDSs,the stop codons of 4149 of the CDSs was identical with the those of CDSsregistered in GenBank.

From the above results, when annotation information of Escherichia coliK-12 strain which is registered in GenBank of NCBI was correct and thenumber (4289) of all the CDSs of Escherichia coli K-12 strain was usedas the denominator, the accuracy of determination was 0.967 in thesensitivity, and was 0.790 in the specificity.

The results of Example 8 was compared with the results of this Example,it was apparent that the accuracy of CDS determination was enhanced bycombining the “method of determining a genetic structure using a shadowdiscrimination function” of the present invention with the “method ofdetermining a genetic structure from the viewpoint of a transcriptionunit structure” of the present invention.

EXAMPLE 10 Determination of the CDSs which Encode a Polypeptide of 34Amino Acids or More From the Entire Genome Sequence of Escherichia coliK-12 Strain (3)

According to the method of Example 7, which is the combination of the“method of determining a genetic structure from the viewpoint of atranscription unit structure”, the “method of determining a geneticstructure using a shadow discrimination function”, and the “method forenhancing the determination accuracy of CDSs by using a code function”which is one of the “methods of deciding on the truth or falsity of aCDS by utilizing the coding potential”, the CDSs which encodepolypeptides of 34 amino acids or more were determined from the entiregenome sequence of Escherichia coli K-12 strain MG1655.

As the result, 2185 CDSs and 2281 CDSs which encode polypeptides of 34amino acids or more were determined from the plus strand and from theminus strand respectively of the entire genome sequence of Escherichiacoli K-12 strain (the accession number is U00096, and the length of thesequence is 4,639,221 base pairs).

The positions of the stop codons of a total of 4466 of the determinedCDSs were compared with the annotation information which was appended tothe genome sequence of Escherichia coli K-12 strain and registered inGenBank of the NCBI, and it was understood that, among the 4466 CDSs,the stop codons of 4163 of the CDSs was identical with those of the CDSsregistered in GenBank.

From the above results, when the annotation information of Escherichiacoli K-12 strain which is registered in GenBank of NCBI was correct andthe number (4289) of all the CDSs of Escherichia coli K-12 strain wasused as the denominator, the accuracy of determination was 0.971 in thesensitivity, and was 0.932 in the specificity.

By comparing the results of Example 9 with the results of this Example,it is apparent that the accuracy of CDS determination is furtherenhanced by adding the “method of deciding upon the truth or the falsityof a CDS by utilizing the coding potential” to the combination of the“method of determining a genetic structure from the viewpoint of atranscription unit structure” and the “method of determining a geneticstructure using a shadow discrimination function” of the presentinvention.

Among the CDSs of 34 amino acids or more of Escherichia coli K-12 strainwhich is registered in GenBank of the NCBI, the 112 CDSs which were notselected by this Example were:

b0005, htgA, b0024, b0302, b0395, ybbD, b0501, ybbv, b0538, ybcM, ninE,b0609, b0667, b0701, b0816, rmf, sfa, ycdF, b1030, b1146, ymgA, b1354,ydas, b1369, b1371, ydcA, b1455, b1459, b1565, dicB, ydiE, b1936, b2191,b2331, b2390, b2504, b2596, b2635, yfjx, yfjz, b2651, b2654, yqgc,b3004, ygiA, tdcR, b3122, insA_(—)6, yrhB, tag, rfaL, yidP, rpmH, ilvM,b3808, b3837, yigW_(—)2, b3913, ytfH, b4250, yjhE, insA_(—)7, yjjY,yi82_(—)1, insA_(—)1, yacG, b0105, b0165, rnhA, yafw, ykfG, ykfF, ykfB,b0263, insA_(—)2, insA_(—)3, ykgJ, b0309, b0362, b0502, ybiI, ymfH,ycgw, yciG, lar, b1420, b1425, b1437, b1506, b1551, relF, b1567, malI,b811, cspc, b1824, insA_(—)5, b2083, b2084, b2641, b2653, b2856, b2859,b2862, yqgB, b3007, yqhc, ilvY, b3776, b3975, yjfA, and pyrL.

(2) Evaluation for the Known Genes of Escherichia coli K-12 Strain

Escherichia coli is a living organism which has been well analyzed byexperimental science. The CDSs described below which was considered tobe analyzed already was compared with the results of analysis by themethod of the present invention.

From the annotation information which is appended to the genome sequenceof Escherichia coli K-12 strain which is registered in GenBank of theNCBI, genes which have “rpl”, “rps”, and “rpm” as the 3 initial lettersof the gene name were searched for as ribosomal proteins, and a total of55 CDSs were selected.

Genes which have “dna” as the 3 initial letters of the gene name weresearched for as genes related to DNA synthesis and replication, and atotal of 11 CDSs were selected.

Genes which have “rec” as the 3 initial letters of the gene name weresearched for, as genes related to DNA recombination and repair, and atotal of 13 CDSs were selected.

Genes which have “pyr” or “pur” as the 3 initial letters of the genename were searched for, as genes related to th synthesis of bases ofpyrimidine and purine, and a total of 23 CDSs were selected.

Furthermore, the 8 CDSs thrA, thrB, thrC, trpA, trpB, trpC, trpD, andtrpE were selected as genes which constitute operons of well knownbiosynthesis pathway of amino acids.

As the results of comparison of the total of 110 of these known genes ofEscherichia coli with the 4466 ORFs which encode polypeptides of 34amino acids or more and determined by the method of the presentinvention, the CDSs which could not be determined by the method of thepresent invention were only rpmH and pyrL among the known 110 genes.

The positions of the start codons of the 108 known genes which was ableto be determined by the method of the present invention were comparedwith the annotation information which is appended to the genome sequenceof Escherichia coli K-12 strain and registered with GenBank of NCBI.

As the result, among the 108 genes, the positions of the start codons of101 of the CDSs, excepting rplY, rpsB, dnaK, dnaQ, danB, recQ, and pyrF,were identical with the annotation information of GenBank of NCBI.

It was shown that the CDSs and the start codons can be determined athigh accuracy from the genome sequence of a microbe by utilizing themethod of the present invention.

EXAMPLE 11 Determination of the CDSs which Encode a Polypeptide of 34Amino Acids or More From the Entire Genome Sequence of Bacillus subtilis(1)

The CDSs which encode polypeptides of 200 amino acids or more weredetermined at high accuracy from the entire genome sequence of Bacillussubtilis (strain 168) according to the method described in Example 5which is the combination of the “method of determining a geneticstructure from the viewpoint of a transcription unit structure” and the“method of determining a genetic structure using a shadow discriminationfunction”, and by carrying out an additional process based upon theinformation of these determined CDSs, the CDSs which encode polypeptidesof 34 amino acids or more were determined at high accuracy.

(1) Calculation of the Ribosome Binding Score for Each ORF and Method ofSearching for the Start Codon

The ribosome binding score of each ORF of Bacillus subtilis werecalculated and the start codons were searched for according to themethod described in (3) of Example 1.

The true start codons and the CDSs were determined according to themethod of determination of the CDSs from Escherichia coli K-12 strain,and utilizing the information of the sequence 3′-UUCCUCCA-5′ in thesequence 3′-UUUCCUCCA-5′ of 3′ terminal sequence of 16S ribosomal RNA ofBacillus subtilis which was obtained by analysis of the data registeredin GenBank of the NCBI.

(2) Determination of the CDSs which Encode Polypeptides of 200 AminoAcids or More

According to the method described in (2) to (9) of Example 1, 1255 CDSsand 1348 CDSs which encode polypeptides of 200 or more amino acids weredetermined from the plus strand and from the minus strand respectivelyof the entire genome sequence of Bacillus subtilis (strain 168) (theaccession number was AL009126 and the length of the sequence was4,214,814 base pairs).

(3) Determination of the CDSs of 34 Amino Acids or More and of the StartCodons

According to the method described in (2) of Example 5, the ORFs whichencode polypeptides of 34 or more amino acids were searched for from theplus strand and from the minus strand of the entire genome sequence ofBacillus subtilis (strain 168) (the accession number was AL009126 andthe length of the sequence was 4,214,814 base pairs).

As the result, 5325 ORFs were determined from the plus strand, and 5532ORFs were determined from the minus strand. According to the methoddescribed in Example 5, the truth or falsity of the CDSs was decided andtheir start codons were determined based upon the information of thedetermined 2603 CDSs which encode polypeptides of 200 amino acids ormore. As the results, 2350 CDSs were determined from the plus strand,and 2602 CDSs were determined from the minus strand.

(4) Outputting of the Results about the Determined CDSs and theEvaluation of these Results

The results about the 4952 CDSs which had been finally determined wereoutputted as a text file upon the hard disk.

As the results of comparing the positions of the stop codons of each ofthe CDSs which were outputted with the annotation information which isappended to the genome sequence of Bacillus subtilis and registered inGenBank of the NCBI, among the 4952 CDSs, the stop codons of 4011 of theCDSs was identical with those of the CDSs registered in GenBank of theNCBI.

When the annotation information for Bacillus subtilis registered inGenBank of NCBI was correct and the entire number (4100) of CDSs ofBacillus subtilis was used as denominator, the accuracy of determinationwas 0.978 in the sensitivity, and was 0.810 in the specificity.

EXAMPLE 12 Determination of the CDS which Encode Polypeptides of 34Amino Acids or More From the Entire Genome Sequence of Bacillus subtilis(2)

The CDSs which encode polypeptides of 34 amino acids or more weredetermined from the entire genome sequence of Bacillus subtilis (strain168) according to the method described in Example 7, which is thecombination of the “method of determining a genetic structure from theviewpoint of transcription unit structure”, the “method of determining agenetic structure using a shadow discrimination function”, and the“method for enhancing the determination accuracy of CDSs by using a codefunction” which is one of the “methods of deciding on the truth orfalsity of CDSs by using a coding potential”.

(1) Determination of the CDS of 34 Amino Acids or More and the StartCodons From the Entire Genome Sequence of Bacillus subtilis

According to the method of Example 7, 2,149 CDSs and 2,395 CDSs whichencode polypeptides of 34 or more amino acids were determined from theplus strand and from the minus strand respectively of the entire genomesequence of Bacillus subtilis (strain 168) (the accession number wasAL009126 and the length of the sequence was 4,214,814 base pairs).

As the results of comparing the positions of the stop codons of total ofthe 4544 determined CDSs with the annotation information which isappended to the genome sequence of Bacillus subtilis and registered inGenBank of the NCBI, among the 4544 CDSs, the stop codons of 4007 of theCDSs was identical with those of the CDSs registered in GenBank of theNCBI.

From the above results, when annotation information for Bacillussubtilis registered in GenBank of NCBI was correct, and the entirenumber (4100) of CDSs of Bacillus subtilis was used as the denominator,the accuracy of determination was 0.977 in the sensitivity, and was0.881 in the specificity.

From the comparison of the results of Example 11 with the results ofthis embodiment, it is apparent that the accuracy of CDS determinationis further enhanced by combining the “method of deciding upon the truthor the falsity of a CDS by utilizing the coding potential” in additionto the combination of the “method of determining a genetic structurefrom the viewpoint of transcription unit structure” and the “method ofdetermining a genetic structure using a shadow discrimination function”of the present invention.

Among the CDSs of 34 amino acids or more of Bacillus subtilis which areregistered in GenBank of NCBI, the 87 CDSs which were not selected bythis Example were: yak, Dacca, yazB, rpmG, ybdL, yczB, comS, phrc, yczI,ydaQ, phrI, ydfH, ydiN, ydiQ, yezD, yfmN, yfmA, yflD, yflC, yhcD, yjcB,ykoP, rpmF, yoav, phrK, yqqo, ypzC, ypuA, yqzI, yrkM, yrac, yrvP, yscA,ytoA, yufc, sbo, ywhR, phrF, ywdc, yxjj, yxeE, yycC, yceK, ycgJ, ydzA,sapB, yhdD, yhds, yheF, yhay, yhaK, yhfD, yhfH, yhjQ, yjcE, ykrB, yobM,yojc, yotN, yotD, yorY, yoqK, yonT, sunA, ypcP, cotD, ypuE, yqgo, yqfv,rpsU, yrkS, yrkG, yrkB, yrdK, yrdB, sigz, yrzK, yshA, comX, yuzF, yuiA,yvzB, usd, ywzB, spsA, yyzE, and rpmH.

(2) Evaluation for the Known Genes of Bacillus subtilis

Bacillus subtilis is a living organism which has been well analyzed byexperimental science. The CDSs described below which was considered tobe analyzed already was compared with the results of analysis by themethod of the present invention.

From the annotation information which is appended to the genome sequenceof Bacillus subtilis which is registered in GenBank of the NCBI, geneswhich have “rpl”, “rps”, and “rpm” as the 3 initial letters of the genename were searched for, as rebosomal proteins, and a total of 52 CDSswere selected.

genes which have “dna” as the 3 initial letters of the gene name weresearched for, as genes which are related to DNA synthesis andreplication, and a total of 11 CDSs were selected.

genes which have “dna” as the 3 initial letters of the gene name weresearched for, as genes which are related to DNA synthesis andreplication, and a total of 11 CDSs were selected.

genes which have “rec” as the 3 initial letters of the gene name weresearched for, as genes which are related to DNA recombination andrepair, and a total of 5 CDSs were selected.

genes which have “pyr” or “pur” as the 3 initial letters of the genename were searched for, as genes which are related to th synthesis ofbases of pyrimidine and purine, and a total of 24 CDSs were selected.

The 10 CDSs thrs, thrB, thrC, thrz, trpA, trpB, trpc, trpD, trpE, andtrpF were selected as genes of well known biosynthesis pathway of aminoacids.

Results of the determination of 4007 ORFs which encode polypeptides of34 amino acids or more by the method of the present invention wereinvestigated in relation to the total of 102 of these known genes ofBacillus subtilis.

As the result, all the 102 genes was able to be determined among theabove described 102 known genes. Next, the positions of the start codonsof the 102 known genes which was able to be determined by the method ofthe present invention were compared with the annotation informationwhich is appended to the genome sequence of Bacillus subtilis andregistered with GenBank of NCBI.

As the results, the positions of the start codons of 82 CDSs among the102 CDSs were identical.

EXAMPLE 13 Determination of the CDS which Encode Polypeptides of 34Amino Acids or More From the Entire Genome Sequence of Pseudomonasaeruginosa Genome (1)

Among the microbes for which the entire genome sequence has beendetermined, the microbe of which the GC content of the entire genomesequence is 60% or more and of which genetic and biochemical analysishas progressed most is Pseudomonas aeruginosa. Thus, in this Example,the CDSs which encode polypeptides of 34 amino acids or more weredetermined from the entire genome sequence of Pseudomonas aeruginosa(strain PAO1).

According to the method described in Example 5, the CDSs which encodepolypeptides of 200 amino acids or more were determined at high accuracyfrom the entire genome sequence of Pseudomonas aeruginosa (starin PAO1),and based upon the information for the determined CDSs, the CDSs whichencode polypeptides of 34 amino acids or more were determined with highaccuracy by carrying out the processes of the “method of determining agenetic structure using a shadow discrimination function”.

(1) Calculation of the Ribosome Binding Score for Each ORF and Searchfor the Start Codons

According to the method described in Example 1 (3), the ribosome bindingscore for each ORF of Pseudomonas aeruginosa was calculated and thestart codons were searched for.

The true start codons and the CDSs were determined according to themethod of determination of the CDSs from Escherichia coli K-12 strainand utilizing the information of the sequence 3′-UUCCUCCA-5′ in thesequence 3′-AUUCCUCCA-5′ of 3′ terminal sequence of 16S ribosomal RNA ofPseudomonas aeruginosa which was obtained by analysis of the dataregistered in GenBank of the NCBI.

(2) Determination of the CDSs of 200 Amino Acids or More

Using the method described in Example 1 (2)-(9) and the method ofdetermination of the start codon described above, 2069 CDSs and 1644CDSs which encode polypeptides of 200 amino acids or more weredetermined from the plus strand and from the minus strand respectivelyof the entire genome sequence of Pseudomonas aeruginosa (strain PAO1)with the GC content of 66.56% (the accession number is AE004091, and thelength of the sequence is 6,264,403 base pairs).

(3) Determination of the CDSs of 34 Amino Acids or More and the StartCodons.

According to the method described in Example 5 (2), the ORFs whichencode polypeptides of 34 amino acids or more were searched for from theplus strand and from the minus strand of the entire genome sequence ofPseudomonas aeruginosa (strain PAO1) (the accession number is AE004091,and the length of the sequence is 6,264,403 base pairs).

The threshold value which was used in deciding the truth or falsity ofthe CDSs based upon the value Sd_(A) of the “shadow discriminationfunction” which was described in Example 5 (2) was changed as describedbelow. The ORF was decided to be a true ORF (CDS): if “Sd_(A) wasgreater than or equal to 1.0”, in the case of an ORF which encodes apolypeptide of 100 amino acids or more; if “Sd_(A) was greater than orequal to 1.1”, in the case of an ORF which encodes a polypeptide of 60amino acids or more and of 99 amino acids or less; and if “Sd_(A) isgreater than or equal to 1.1”, in the case of an ORF which encodes apolypeptide of 34 amino acids or more and of 59 amino acids or less.

As the results of searching for the ORFs based upon this condition, 8519ORFs were determined from the plus strand, and 8307 ORFs were determinedfrom the minus strand.

The truth or falsity of the CDSs was decided and the start codons weredetermined according to the method described in Example 5, based uponthe information of 3713 CDSs which encode polypeptides of 200 aminoacids or more and was determined as described above.

As the results, 3062 CDSs were determined from the plus strand, and 3166CDSs were determined from the minus strand.

(4) Outputting the Results about the Determined CDSs and the Evaluationof the Results

The results about the 6288 CDSs which were finally determined wereoutputted as a text file upon a hard disk.

The positions of the stop codons of the CDSs which had been outputtedwere compared with the annotation information which was appended to thegenome sequence of Pseudomonas aeruginosa and registered in GenBank ofNCBI, and among the 6228 CDSs, the stop codons of the 5018 CDSs wereidentical with those of the CDSs registered in GenBank.

When the annotation information of Pseudomonas aeruginosa which isregistered in GenBank of NCBI was correct and the total number (5565) ofCDSs of the Pseudomonas aeruginosa was used as the denominator, theaccuracy of determination was 0.902 in the sensitivity, and was 0.806 inthe specificity.

EXAMPLE 14 Determination of the CDS which Encode Polypeptides of 34Amino Acids or More From the Entire Genome Sequence of Pseudomonasaeruginosa Genome (2)

In this Example, it was shown that the accuracy of CDS determination wasfurther enhanced by combining a “method of enhancing CDS determinationaccuracy from the viewpoint of the GC content of bases in the codons”with the method described in Example 13 which is the combination of the“method of determining a genetic structure from the viewpoint of atranscription unit structure” and the “method of determining a geneticstructure using a shadow discrimination function”.

(1) A Method of Determining a Genetic Structure From the Viewpoint ofthe GC Content of Bases in the Codons

The content of the first and the third G residues and C residues of thecodons within the CDS was defined as the GC content of the bases withinthe codons, and was calculated as the value of GC_(i) of the formula (5)below (hereinafter referred to as the value of the “GC function”):$\begin{matrix}{{GC}_{i} = {{\left( {y^{i{(1)}} + y^{i{(3)}}} \right)/{\sum\limits_{r = 1}^{3}\quad{y^{i{(r)}}\quad{wherein}\quad y^{i{(r)}}}}} = {\sum\limits_{n = 1}^{N_{1}}\quad{\sum\limits_{b = 1}^{4}\quad x_{n{(b)}}^{i{(r)}}}}}} & (5)\end{matrix}$

[herein when the r-th base (r is 1, 2, 3) of the n-th codon of the i-thCDS is b (b is 1, 2, 3, 4), thenx_(n(b))^(i(r))  is  x_(n(b))^(i(r)) = 1_((b = 1  or  2))  x_(n(b))^(i(r)) = 0_((b = 3  or  4))and, b is respectively 1, 2, 3, or 4, when the r-th base of the n-thcodon of the i-th CDS is respectively G, C, A, or T, i and n arepositive integers, and N_(i) denotes the total number of codons of thei-th CDS (excluding its stop codon)].

(2) The Determination of the CDS of 34 Amino Acids or More From theEntire Genome Sequence of Pseudomonas aeruginosa Genome

The value of the above described “GC function” was utilized in the“re-searching for a start codon” and in the “decision of the truth orfalsity of a CDS”.

In the “re-searching for a start codon”, the value of the “GC function”is utilized when it was decided that there was no possibility of forminga polycistronic transcription unit by the method according to thedescription of Example 1 (3) or (6)

Specifically, the value of the “GC function” of the region of 60 aminoacid from the 5′ terminal of the selected candidate for a true ORF whichencodes a polypeptide of 150 amino acids or more was obtained, and, ifthe value was less than or equal to ⅔, the start codon was re-searchedfor from the next codon to the determined start codon in the directiontowards the 3′. “Re-searching for a start codon” was repeated a maximumof twice.

With regard to “deciding the truth or falsity of a CDS”, the value ofthe “GC function” of the selected CDS was obtained, and, if its valuewas less than or equal to ⅔, it was decided that the CDS is a “falseCDS”.

According to the method wherein these processes were added to the methodas described in Example 13, The CDSs which encode polypeptides of morethan 34 amino acids were determined from the plus strand and the minusstrand of the entire genome sequence (the accession number is AE004091,and the length of the sequence is 6,264,403 base pairs) of Pseudomonasaeruginosa (strain PAO1).

As the results of selecting CDSs in this manner, 2896 CDSs and 3290 CDSswhich encode polypeptides of 34 amino acids or more were determined fromthe plus strand and from the minus strand respectively.

The positions of the stop codons of a total of the determined 6186 CDSswere compared with the annotation information which was appended to thegenome sequences of Pseudomonas aeruginosa and registered in GenBank ofNCBI.

As the results, the stop codons of 5300 of the CDSs were identical withthose of the CDSs registered in GenBank.

From the above results, when the annotation information for Pseudomonasaeruginosa registered in GenBank of NCBI was correct and the entirenumber (5565) of CDSs of Pseudomonas aeruginosa was used as thedenominator, the accuracy of determination was 0.952 in the sensitivity,and was 0.857 in the specificity.

From comparison of the results of Example 13 with the results of thisExample, it was shown that the accuracy of CDS determination was furtherenhanced by combining a “method of determining a genetic structure fromthe viewpoint of the GC content of the bases in the codons” with themethod which is the combination of the “method of determining a geneticstructure from the viewpoint of a transcription unit structure” and the“method of determining a genetic structure using a shadow discriminationfunction” of the present invention.

EXAMPLE 15 The Determination of the CDSs which Encode Polypeptides of 34Amino Acids or More From the Entire Genome Sequence of Pseudomonasaeruginosa Genome (3)

According to the method of Example 14 which is the combination of the“method of determining a genetic structure from the viewpoint of atranscription unit structure”, the “method of determining a geneticstructure using a shadow discrimination function”, the “method ofdetermining a genetic structure from the viewpoint of the GC content ofthe bases in the codons”, and the “method for enhancing thedetermination accuracy of CDSs by using a code function” which is one ofthe “methods of deciding upon the truth or the falsity of a CDS byutilizing the coding potential”, the CDSs which encode polypeptides of34 amino acids or more were determined from the entire genome sequenceof Pseudomonas aeruginosa (strain PAO1).

(1) Determination of the CDS of 34 Amino Acids or More and the StartCodons From the Entire Genome Sequence of Pseudomonas aeruginosa

Except changing the threshold value of the shadow discriminationfunction for selecting the CDSs, according to the method of Example 7,the CDSs which encode polypeptides of 34 amino acids or more weredetermined from the entire genome sequence (the accession number wasAE004091, and the length of the sequence was 6,264,403 base pairs) of aPseudomonas aeruginosa (strain PAO1).

The threshold value of the shadow discrimination function for selectingthe CDS was set in the following manner.

Based upon the value of the “code function” Cd_(A) which was obtainedaccording to the above Example 7, the ORF-A was decided as a true CDS ifthe following condition was satisfied. It was decided that said ORF wasa true ORF (CDS) : if “Cd_(A) is greater than or equal to 1.5” in thecase of an ORF which encodes a polypeptide of 100 amino acids or more;if “Cd_(A) is greater than or equal to 1.5” in the case of an ORF whichencodes a polypeptide of 60 amino acids or more and 99 amino acids orless; if “Cd_(A) is greater than or equal to 1.6” in the case of an ORFwhich encodes a polypeptide of 34 amino acids or more and 59 amino acidsor less, when.

Furthermore, when the value of the “shadow discrimination function” wasobtained, it was possible to decide on the truth or falsity of the CDS,based upon the value (Sd×Cd_(A)) which is the product of the valueSd_(A) of the “shadow discrimination function” of ORF-A and of the valueCd_(A) of the “code function”. It is decided that the ORF is a true ORF(CDS): if “Sd_(A)×Cd_(A) is greater than or equal to 1.8” in the case ofan ORF which encodes a polypeptide of 100 amino acids or more; if“Sd_(A)×Cd_(A) is greater than or equal to 2.0” in the case of an ORFwhich encodes a polypeptide of 60 amino acids or more and 99 amino acidsor less; and if “Sd_(A)×Cd_(A) is greater than or equal to 2.1” in thecase of an ORF which encodes a polypeptide of 34 amino acids or more and59 amino acids or less.

As the results of selecting CDSs in this manner, 2716 CDSs and 2859 CDSswhich encode polypeptides of 34 amino acids or more were determined fromthe plus strand and from the minus strand respectively.

The positions of the stop codons of a total of the 5575 determined CDSswere compared with the annotation information appended to the genomesequence of Pseudomonas aeruginosa registered in GenBank of NCBI.

As the results, among the 5575 CDSs, the stop codons of 5299 of the CDSswere identical with those of the CDSs registered in GenBank.

From the above results, when the annotation information for Pseudomonasaeruginosa registered in GenBank of NCBI was correct and the entirenumber (5565) of CDSs of the K-12 strain of Pseudomonas aeruginosa wasused as the denominator, the accuracy of determination was 0.952 in thesensitivity, and was 0.950 in the specificity.

From comparing the results of Example 14 with the results of thisExample, it was shown that the accuracy of CDS determination was furtherenhanced by combining a “method of deciding upon the truth or thefalsity of a CDS by utilizing the coding potential” with the methodwhich is the combination of the “method of determining a geneticstructure from the viewpoint of a transcription unit structure”, the“method of determining a genetic structure using a shadow discriminationfunction” and the “method of determining a genetic structure from theviewpoint of the GC content of the bases in the codons” of the presentinvention.

Among the CDSs of 34 amino acids or more of Pseudomonas aeruginosa whichare recorded in GenBank of NCBI, the 262 CDSs which were not selected inthis example were: PA0012, PA0047, PA0050, PA0104, PA0127, PA0128,PA0135, PA0160, PA0161, PA0167, PA0279, PA0318, PA0433, PA0462, PA0478,PA0483, PA0529, PA0560, PA0621, PA0632, PA0634, PA0635, PA0642, PA0646,PA0647, PA0648, PA0715, PA0716, PA0719, PA0722, PA0729, PA0756, PA0817,PA0819, PA0820, PA0884, PA0885, csrA, PA0954, PA0960, PA0980, PA0981,PA0983, PA0984, PA0986, PA0991, PA0993, PA1096, PA1112, pys2, imm2,PA1152, PA1329, PA1357, PA1369, PA1370, PA1377, galE, PA1385, PA1386,PA1414, PA1426, PA1427, PA1441, PA1468, PA1469, ccmG, PA1545, gpsA,PA1625, pcrR, pscE, pscF, PA1834, PA1882, PA1889, PA1936, PA1963,PA2036, PA2037, PA2105, PA2139, PA2146, PA2221, PA2245, PA2372, PA2456,PA2480, PA2485, cysK, PA2710, PA2816, oprI, PA2878, PA2880, hisj, rmf,PA3090, xcpP, PA3218, PA3274, PA3451, pyrC, PA3632, PA3662, PA3717,PA3764, PA3843, PA3888, PA3964, thiD, PA3981, PA3988, PA4028, PA4074,PA4095, PA4131, PA4134, PA4141, PA4146, PA4291, PA4295, fimu, PA4776,PA4789, PA4860, PA4880, vanA, PA5202, PA5395, PA5432, PA5462, PA5480,PA0014, PA0076, PA0141, PA0257, PA0258, PA0264, PA0311, PA0388, PA0442,PA0453, PA0468, nirM, PA0532, PA0656, PA0781, prpB, PA0797, PA0805,PA0814, PA0822, PA0874, PA0941, PA0977, PA0985, PA1026, PA1034, PA1044,PA1170, napE, PA1195, PA1332, PA1333, PA1359, PA1371, PA1372, rsaL,PA1508, PA1509, PA1531, PA1540, PA1653, pscQ, PA1799, PA1935, PA1939,PA2182, PA2222, PA2223, PA2224, PA2226, PA2227, PA2228, PA2459, PA2460,PA2461, PA2544, PA2570, PA2582, PA2621, PA2730, PA2731, PA2772, PA2775,PA2794, rpmF, PA2980, moaB2, PA3034, PA3051, PA3065, PA3143, PA3144,wbpL, wbpK, wbpj, wbpI, wbpH, wbpG, hisF2, hisH2, wzx, wzy, wbpE, wbpD,PA3157, wzz, PA3169, PA3270, PA3291, PA3292, PA3390, PA3520, PA3577,PA3591, PA3623, PA3696, PA3782, PA3829, PA3866, PA3998, PA4041, pchR,rpmc, rpsJ, secE, birA, PA4326, PA4349, PA4388, pilA, PA4534, rpmA,PA4607, PA4638, PA4674, PA4709, secG, tpiA, selA, PA4840, PA4872,PA5061, PA5086, PAS087, PA5088, PA5104, trxA, PA5388, and PA5528.

Industrial Applicability

According to the method of the present invention, it is possible todetermine a genetic structure with enhanced accuracy.

In particular, according to the method of the present invention, theprediction of the structure of polycistronic transcription units ispossible, and it is possible to enhance the accuracy of determination ofthe positions of start codons, information in advance is unnecessary,and it is possible to determine a genetic structure for the nucleotidesequence with the high GC content.

1. A method of determining a genetic structure of a prokaryote, whichcomprises the steps (a) to (g) described below: (a) setting atranslation stop codon from information about the nucleotide sequence ofa prokaryote (a nucleotide sequence is a sequence of DNA or RNA), andsetting a provisional translation start codon which yields the longestopen reading frame (hereinafter abbreviated as ORF) based upon saidtranslation stop codon; (b) deciding that the ORF-A and the ORF-B have apossibility to form a single transcription unit if the provisional startcodon of the ORF-A is upstream of the translation stop codon of theORF-B, or is within D_(S) bases downstream of said translation stopcodon [herein D_(S) is an integer from 20 to 100], wherein any twoneighboring ORFs which are obtained in the step (a) and present on thesame strand are termed ORF-A and ORF-B from downstream; (c) determiningthat the candidate for the translation start codon is the translationstart codon of ORF-A if the ORF-A and the ORF-B are decided to have apossibility to form a single transcription unit in the step (b) and ifthe candidate for the translation start codon is present within a region(hereinafter termed the “vicinity of the translation stop codon”)between D_(B) bases downstream from the first T (thymidine) residue ofthe translation stop codon of the ORF-B and U_(B) bases upstream fromsaid T residue [herein D_(B) is an integer between 10 and 20, and U_(B)is an integer between 3 and 15], and determining the translation startcodon of the ORF-A from a priority ranking determined by using thedistance between each candidate and the translation stop codon of theORF-B as an indicator if there is a plurality of candidates; (d)examining whether a candidate for the translation start codon of theORF-A is present within a region (hereinafter termed the “region aroundthe vicinity of the translation stop codon”) between R_(D) basesdownstream from the first T residue of the translation stop codon of theORF-B and R_(U) bases upstream from said T residue and excluding said“vicinity of the translation stop codon” [herein R_(D) is an integerfrom 30 to 120, and R_(U) is an integer from 20 to 120] if thetranslation start codon of the ORF-A can not be determined in the step(c); (e) examining whether a ribosome binding site is present from 1 to30 bases upstream of a candidate for the translation start codon of theORF-A if the candidate is present in the region around the vicinity ofthe translation stop codon in the step (d), determining its ribosomebinding sequence if such a ribosome binding site is present, anddetermining that the candidate which corresponds to said ribosomebinding sequence is the translation start codon of the ORF-A; (f)searching for up to the number N of candidates for the translation startcodon including the provisional start codon which yields the longest ORFfrom the 5 terminal of an ORF-A which is not decided to have apossibility to form a single transcription unit in the step (b) or whosetranslation start codon is not determined in the step (e), investigatingwhether a ribosome binding site is present from 1 to 30 bases upstreamof each candidate, determining its ribosome binding sequence if such aribosome binding site is present, and determining that the candidatewhich corresponds to said ribosome binding sequence is the translationstart codon [herein N is an integer from 5 to 20]; (g) confirming thepositions of the translation start codon and the translation stop codon,the coding region, and the transcription units from the results ofdetermination by the step (c), the step (e) or the step (f) to determinea genetic structure.
 2. The method of determining a genetic structureaccording to claim 1, wherein the step (e) is a step of determining thetranslation start codon of an ORF-A by the following steps: determiningthat a mRNA sequence whose ribosome binding score exceeds a thresholdvalue V₃, described below, is a ribosome binding sequence [herein V₃ isan integer from 4 to 12], wherein the paired state between a mRNAsequence of 4 to 17 bases upstream of a candidate for the translationstart codon of the ORF-A and a sequence (3′-UUCCUCC-5′) involved in thebinding to mRNA in a 16S rRNA 3′ terminal sequence, or between a mRNAsequence of 4 to 16 bases upstream of said candidate and a sequence(3′-UCCUCC-5′) involved in the binding to mRNA in a 16S rRNA 3′ terminalsequence, is expressed as a numerical value, which is termed a “scorewhich shows the binding state between mRNA and a ribosome” (hereinaftertermed a ribosome binding score), according to the four rules describedbelow: (1) A pairing of G and C yields +4; (2) A pairing of A and Uyields +2; (3) A pairing of G and U yields +1; (4) When no pairing ispresent at a base pair which is adjacent to a base pair where a pairingis present, then this yields −1; determining that the candidate whichcorresponds to said ribosome binding sequence is the translation startcodon; dividing the “region of an ORF-B around the vicinity of the stopcodon” into the two of “the region downstream of said vicinity” and “theregion upstream of said vicinity” if there is a plurality of saidtranslation start codons, and determining the one of said translationstart codons which has the highest priority is the true translationstart codon based on the priority of “the region downstream of saidvicinity” and “the region upstream of said vicinity” in that order;determining the translation stop codon of the ORF-A from a priorityranking defined by using the distance from the translation stop codon ofthe ORF-B as an indicator if a plurality of translation start codons ispresent within the respective regions.
 3. The method of determining agenetic structure according to claim 1 or claim 2, wherein the step (f)is a step of determining the translation start codon of an ORF-A by thefollowing steps: determining that the mRNA sequence whose ribosomebinding score exceeds a threshold value V, described below, ₁ is aribosome binding sequence, wherein the paired state between a mRNAsequence of from 4 to 17 bases upstream of a candidate for thetranslation start codon of the ORF-A and a sequence (3′-UUCCUCC-5′)involved in the binding to mRNA in a 16S rRNA 3′ terminal sequence, orbetween a mRNA sequence of 4 to 16 bases upstream of said candidate anda sequence (3′UCCUCC-5′) involved in the binding to mRNA in a 16S rRNA3′ terminal sequence, is expressed as a numerical value, termed“ribosome binding score”, according to the four rules described below:(1) A pairing of G and C yields +4; (2) A pairing of A and U yields +2;(3) A pairing of G and U yields +1; (4) When no pairing is present at abase pair which is adjacent to a base pair where a pairing is present,then this yields −1; determining that the candidate which corresponds tosaid ribosome binding sequence is the translation start codon;determining that the translation start codon corresponding to theribosome binding sequence which yields the highest score is the truetranslation start codon if there is a plurality of said translationstart codons; setting one or more threshold value(s) smaller than V₁,which include the threshold value V₃, if there is no candidate whichexceeds the threshold value V₁, and determining the translation startcodon of the ORF-A in a stepwise manner if said threshold value isexceeded [herein V₁ is an integer which is greater than the V₃ of claim2, and which is between 7 and 14].
 4. The method of determining agenetic structure according to claim 2 or claim 3, wherein the “ribosomebinding score” is calculated by deducting a numerical value P_(G) if thetranslation start codon is GTG, or by deducting a numerical value P_(T)if the translation start codon is TTG [herein P_(G) is an integer from 1to 4, and P_(T) is an integer from 2 to 6].
 5. A method of determining agenetic structure, wherein a transcription unit P, a coding region A, atranscription unit Q, and a coding region B is determined by utilizingthe method according to any one of claim 1 to claim 4, which furthercomprises the steps (h) to (j) described below if the transcription unitP or the coding region A overlaps with the transcription unit Q or thecoding region B: (h) deciding that the transcription unit Q or thecoding region B is a “false transcription unit” or a “false codingregion” if a transcription unit Q or a coding region B which is presentupon the same strand as a transcription unit P or a coding region A isincluded in the transcription unit P or the coding region A; (i)deciding that the transcription unit Q or the coding region B is a“false transcription unit” or a “false coding region” if a transcriptionunit Q or a coding region B which is present upon the complementarystrand to a transcription unit P or a coding region A is included in thetranscription unit P or the coding region A; (j) deciding that thetranscription unit or coding region whose length is shorter is a “falsetranscription unit” or a “false coding region” when a transcription unitP or a coding region A overlaps with a transcription unit Q or a codingregion B which is present upon the complementary strand.
 6. A method ofdetermining a genetic structure, wherein the method of determining agenetic structure according to any one of claim 1 to claim 5 is utilizedrepeatedly.
 7. A method of determining a genetic structure of aprokaryote, which comprises the steps (k) and (1) described below: (k)selecting k types of combination of codons wherein “the frequency ofappearance of one codon is high and the frequency of appearance of acodon which has the complementary sequence to the 3-base sequence ofsaid codon is low” in a plurality (the number T) of determined codingregions of the prokaryote; (l) comparing the “number of times of the ktypes of codons whose frequency of appearance is high appearing in acoding region A which is assumed to be a coding region” with the “numberof times of the k types of codons whose frequency of appearance is lowappearing in said coding region A”, and deciding on the truth or falsityof said coding region A [herein k is an integer greater than or equal to5 and less than or equal to 20].
 8. The method of determining a geneticstructure according to claim 7, wherein the method for comparing the“number of times of the k types of codons whose frequency of appearanceis high appearing in a coding region A which is assumed to be a codingregion” with the “number of times of the k types of codons whosefrequency of appearance is low appearing in said coding region A” is amethod which involves using “the reciprocal of the sum of 1 and theratio of the number of the latter to the number of the former” as acalculation formula and which involves deciding that said coding regionA is a “false coding region” if the value of said reciprocal is lessthan a fixed value.
 9. The method of determining a genetic structureaccording to claim 7, which is based on the nucleotide sequence of thenumber T of determined coding regions of the prokaryote and comprisesthe steps (m) to (p) described below: (m) arranging the 64 types ofcodons so that the 3-base sequence of the i-th codon has thecomplementary sequence to the nucleotide sequence of the (i+32)-thcodon; (n) obtaining y_(i) from the formula (2) below and y_(i+32) fromthe formula (3) below: $\begin{matrix}{y_{i} = {\left( {{\sum\limits_{t = 1}^{T}\quad C_{i}^{t}} - {\sum\limits_{i = 1}^{T}C_{i + 32}^{t}}} \right)/{\sum\limits_{i = 1}^{T}{\sum\limits_{j = 1}^{64}C_{j}^{t}}}}} & (2) \\{y_{i + 32} = {\left( {{\sum\limits_{t = i}^{T}\quad C_{i + 32}^{t}} - {\sum\limits_{i = 1}^{T}C_{i}^{t}}} \right)/{\sum\limits_{i = 1}^{T}{\sum\limits_{j = 1}^{64}C_{j}^{t}}}}} & (3)\end{matrix}$ wherein the number of appearances of the i-th codon in thet-th coding region is expressed asC^(t) _(j) (o) rearranging the 64 types of codon in the step (m) indescending order of the y_(i) and the y_(i+32), selecting top k types ofcodons for which the value of y_(i or of y) _(i+32) is large, andobtaining the value of Sd_(A) for a coding region A by the followingformula (4): $\begin{matrix}{{Sd}_{A} = {2 \times {\sum\limits_{i = 1}^{k}\quad{C_{i}^{A}/\left( {{\sum\limits_{i = 1}^{k}\quad C_{j}^{A}} + {\sum\limits_{i = {65 - k}}^{64}C_{i}^{A}}} \right)}}}} & (4)\end{matrix}$ [herein the value of Sd_(A) is defined as 1 if$\left( {{\sum\limits_{i = 1}^{k}\quad C_{i}^{A}} + {\sum\limits_{i = {65 - k}}^{64}C_{i}^{A}}} \right)$is zero]. (p) deciding that a coding region A is a true coding region ifthe value of Sd_(A) of said coding region calculated in the process (o)is greater than or equal to a threshold value S₁, and that it is a falsecoding region if said value of Sd_(A) is less than the threshold valueS₁ [herein T is an integer. greater than or equal to 2, i is a positiveinteger less than or equal to 32, j is a positive integer less than orequal to 64, t is a positive integer less than or equal to T, k is aninteger from 5 to 20, and S1 is a value from 0.8 to 1.8].
 10. A methodof determining a genetic structure of a prokaryote, which comprises thesteps (q) and (r) described below, wherein a coding region of theprokaryote or a coding region A which is assumed to be a coding regionoverlaps with a coding region B which is assumed to be a coding regionand present upon the complementary strand, and said coding region B isincluded in said coding region A: (q) comparing the length L_(B) (inbase pairs) of said coding region B with the length L_(A) (in basepairs) of said coding region A, and deciding that said coding region Bis a “false coding region” if L_(B) is less than or equal to T_(P) % ofL_(A); (r) deciding on the truth or falsity of said coding region A andof said coding region B by the method according to any one of claim 7 toclaim 9 if L_(B) exceeds T_(P) % of L_(A) [herein, T_(P) is a positiveinteger from 30 to 95].
 11. A method of determining a genetic structure,characterized by removing the translation stop codons from the codingregions which form a transcription unit, and linking up the resultingcoding regions into a single coding region, before utilizing the methodaccording to any one of claim 7 to claim
 10. 12. A method of determininga genetic structure, which comprises: deciding on the truth or falsityof a coding region or of a transcription unit which is determined by themethod of determining a genetic structure according to any one of claim1 to claim 6, by utilizing the method of determining a genetic structureaccording to any one of claim 7 to claim
 11. 13. A method of determininga genetic structure, which comprises: deciding on the truth or falsityof a coding region which encodes a polypeptide of L_(M) amino acids ormore in length, by using the method of determining a genetic structureaccording to any one of claim 7 to claim 12, based on the nucleotidesequence of a coding region which is determined by using the method ofdetermining a genetic structure according to any one of claim 1 to claim12 and which encodes a polypeptide of L_(F) amino acids or more inlength [herein L_(F) is a positive integer greater than or equal to 100,and L_(M) is a positive integer greater than or equal to 20].
 14. Amethod of determining a genetic structure of a prokaryote, characterizedby deciding that a coding region in the nucleotide sequence of theprokaryote is a “false coding region” if the GC content of saidnucleotide sequence is greater than 50% and if a content, calculated byutilizing a calculation formula which yields a content of the first andthird G residues and C residues of the codons in said nucleotidesequence, is less than a fixed value.
 15. The method of determining agenetic structure according to claim 14, wherein the following formula(5) is used as a calculation formula, the value of GC_(i) describedbelow is used as a calculated content, and one value which is selectedfrom 0.6 to 0.75 is used as a fixed value: $\begin{matrix}{{GC}_{i} = {{\left( {y^{i{(1)}} + y^{i{(3)}}} \right)/{\sum\limits_{i = 1}^{3}\quad{y^{i{(r)}}\quad{wherein}\quad y^{i{(r)}}}}} = {\sum\limits_{n = 1}^{N_{i}}\quad{\sum\limits_{b = 1}^{4}x_{n{(b)}}^{i{(r)}}}}}} & (5)\end{matrix}$ [herein when the r-th base (r is 1, 2, or 3) of the n-thcodon of the i-th coding region is b (b is 1, 2, 3, or 4), thenx_(n(b))^(i(r))  is  x_(n(b))^(i(r)) = 1_((b = 1  or  2))  x_(n(b))^(i(r)) = 0_((b = 3  or  4)) and, as for b, when the r-th baseof the n-th codon of the i-th coding region is G, C, A, or T, b is 1, 2,3, or 4, respectively, i and n are positive integers, and N_(i) denotesthe total number of the codons (excluding the translation stop codon) ofthe i-th coding region].
 16. A method of determining a genetic structureof a prokaryote, which comprises: deciding that a coding region in thenucleotide sequence of the prokaryote is a “false coding region” if theGC content of said nucleotide sequence is greater than 50%, and if acontent, calculated by utilizing a calculation formula which yields acontent of the first and third G residues and C residues of the codonsin said nucleotide sequence, is less than a fixed value; andre-searching for a translation start codon which is present downstreamof said translation start codon which is decided to be false.
 17. Themethod of determining a genetic structure according to claim 16, whereinthe following formula (5) is used as a calculation formula, the value ofGC_(i) described below is used as a calculated content, and one valuewhich is selected from 0.6 to 0.75 is used as a fixed value:$\begin{matrix}{{GC}_{i} = {{\left( {y^{i{(1)}} + y^{i{(3)}}} \right)/{\sum\limits_{i = 1}^{3}\quad{y^{i{(r)}}\quad{wherein}\quad y^{i{(r)}}}}} = {\sum\limits_{n = 1}^{N_{i}}\quad{\sum\limits_{b = 1}^{4}x_{n{(b)}}^{i{(r)}}}}}} & (5)\end{matrix}$ [herein, when the r-th base (r is 1, 2, or 3) of the n-thcodon of the i-th coding region is b (b is 1, 2, 3, or 4), thenx_(n(b))^(i(r))  is  x_(n(b))^(i(r)) = 1_((b = 1  or  2))  x_(n(b))^(i(r)) = 0_((b = 3  or  4)) and, as for b, when the r-th baseof the n-th codon of the i-th coding region is G, C, A, or T, b is 1, 2,3, or 4, respectively, i and n are positive integers, and N_(i) denotesthe total number of the codons (excluding the translation stop codon) ofthe i-th coding region].
 18. A method of determining a genetic structureof a prokaryote whose GC content in the nucleotide sequence exceeds 50%,wherein the method of determining a genetic structure according to anyone of claim 1 to claim 13 and the method of determining a geneticstructure according to any one of claim 14 to claim 17 are utilized. 19.A method of determining a genetic structure of a prokaryote, wherein themethod of determining a genetic structure according to any one of claim1 to claim 18 and a “method of deciding on the truth or falsity of acoding region by utilizing a coding potential” are utilized.
 20. Themethod of determining a genetic structure according to claim 19, whereinsaid “method of deciding on the truth or falsity of a coding region byutilizing a coding potential” is a method of deciding on the truth orfalsity of the coding region A described below by, based upon thenucleotide sequences of the number T of the determined coding regions ofthe prokaryote, comparing the “number of times of m types of codonswhose frequency of appearance is high appearing in the coding region Awhich is assumed to be the coding region” with the “number of times of mtypes of codons whose frequency of appearance is low appearing in thecoding region A” for the number T of coding regions [herein, T is aninteger greater than or equal to 2, and m is an integer greater than orequal to 5 and less than or equal to 20].
 21. The method of determininga genetic structure according to claim 20, wherein the method ofcomparing the “number of times of m types of codons whose frequency ofappearance is high appearing in the coding region A which is assumed tobe the coding region” and the “number of times of m types of codonswhose frequency of appearance is low appearing in the coding region A”is a method which involves utilizing the “reciprocal of the sum of 1 andthe ratio of the number of the latter to the number of the former” as acalculation formula, and which decides that said coding region A is a“false coding region” if the value of said reciprocal is less than afixed value [herein m is an integer greater than or equal to 5 and lessthan or equal to 20].
 22. The method of determining a genetic structureaccording to claim 20, which comprises the steps (s) to (u) describedbelow: (s) obtaining yi from the following formula (6): $\begin{matrix}{y_{i} = {\sum\limits_{i = 1}^{T}{C_{i}^{t}/{\sum\limits_{i = 1}^{T}{\sum\limits_{j = 1}^{64}C_{j}^{t}}}}}} & (6)\end{matrix}$ wherein the number of times of the i-th codon appearing inthe t-th coding region is expressed asC^(t) _(j) (t) rearranging the 64 types of codon in descending order ofy_(i), selecting “top m codons for which the value of y_(i) is large”and “bottom m codons for which the value of y_(i) is large, excludingthe translation stop codon”, and obtaining the value of Cd_(A) for thecoding region A which is assumed to be the coding region from thefollowing formula (7): $\begin{matrix}{{Cd}_{A} = {2 \times {\sum\limits_{i = 1}^{m}\quad{C_{i}^{A}/\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{61}C_{i}^{A}}} \right)}}}} & (7)\end{matrix}$ [herein the value of Cd_(A) is defined as 1 if$\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{61}\quad C_{i}^{A}}} \right)$is zero] (u) deciding that said coding region A is a true coding regionif the value of Cd_(A) for said coding region A which is calculated inthe step (t) is greater than or equal to a threshold value CV, anddeciding that it is a false coding region if said value of Cd_(A) isless than the threshold value CV [herein T is an integer greater than orequal to 2; i is a positive integer less than or equal to 64; j is apositive integer less than or equal to 64; t is a positive integer lessthan or equal to T, m is an integer from 5 to 20; and CV is a value from0.8 to 1.8].
 23. A method of determining a genetic structure, whichcomprises the steps (v) and (w) described below if a coding region ofthe prokaryote or a coding region A which is assumed to be a codingregion overlaps with a coding region B which is assumed to be a codingregion and present upon the complementary strand, and if said codingregion B is included in said coding region A: (v) comparing the lengthL_(B) (in base pairs) of said coding region B with the length L_(A) (inbase pairs) of said coding region A, and deciding that said codingregion B is a “false coding region” if L_(B) is less than or equal toT_(P) % of L_(A); (w) deciding on the truth or falsity of said codingregion A and of said coding region B by the method of determining agenetic structure according to any one of claim 18 to claim 22 if L_(B)exceeds T_(P) % of LA [herein TP is a positive integer from 30 to 95].24. A method of determining a genetic structure, characterized byremoving the translation stop codons from the coding regions which forma transcription unit, and linking up the resulting coding regions into asingle coding region, before utilizing the method of determining agenetic structure according to any one of claim 18 to claim
 23. 25. Aprogram for executing the following steps on a computer: (a) finding atranslation stop codon in the nucleotide sequence of a prokaryote fromthe information of said nucleotide sequence inputted via an inputdevice, searching for a provisional translation start codon which yieldsthe longest open reading frame (ORF) for all the obtained translationstop codons to make a candidate for ORF which is the combination of thesaid translation stop codon and provisional translation start codon, andstoring the position of these codons in said nucleotide sequence in amemory; (b) calling up from the memory two adjacent candidates for ORFwhich are present upon the same strand, investigating the positions ofthe provisional translation start codon of the downstream side ORF(termed ORF-A) and of the translation stop codon of the upstream sideORF (termed ORF-B) and the distance between the ORF-A and the ORF-B; anddeciding that the two adjacent ORFs have a possibility to form a singletranscription unit if the provisional translation start codon of theORF-A is upstream of the translation stop codon of the ORF-B, or iswithin D_(S) bases downstream of said translation stop codon [hereinD_(S) is an integer from 20 to 100], and proceeding to the step (c); ordeciding that the two adjacent ORFs do not form a single transcriptionunit if the distance between the positions of the provisionaltranslation start codon of the ORF-A and of the translation stop codonof the ORF-B does not satisfy the above described condition, andproceeding to the step (f); (c) calling up the above describednucleotide sequence data for the two ORFs which are decided to have apossibility to form a single transcription unit in the step (b), andsearching for a candidate for the translation start codon of the ORF-Afrom a region (hereinafter termed the “vicinity of the translation stopcodon”) between D_(B) bases downstream from the first T (thymidine)residue of the translation stop codon of the ORF-B and U_(B) basesupstream from said T residue [here D_(B) is an integer between 10 and20, and U_(B) is an integer between 3 and 15]; and determining that theORF-A whose translation start codon is said candidate is a true codingregion if there is a single candidate for the translation start codon,determining that said ORF-A and ORF-B form a single transcription unit,and writing the results of this determination into the memory; orselecting the candidate whose priority is the highest if there is aplurality of candidates for the translation start codon, wherein thedistance between each candidate and the translation stop codon of theORF-B is used as an indicator of priority, determining that the ORF-Awhose translation start codon is said candidate is a true coding region,and determining that said ORF-A and ORF-B constitute a singletranscription unit, and writing the results of the determination intothe memory; (d) calling up the above described nucleotide sequence dataif the translation start codon of the ORF-A can not be determined in thestep (c), examining whether a candidate for the translation start codonof the ORF-A is present within a region (hereinafter termed the “codingregion around the vicinity of the translation stop codon”) between R_(D)bases downstream from the first T residue of the translation stop codonof the ORF-B and R_(U) bases upstream from said T residue [here R_(D) isan integer from 30 to 120, and R_(U) is an integer from 20 to 120] andexcluding the “vicinity of the translation stop codon”; and proceedingto the step (e) if a candidate for the translation start codon of theORF-A is present in said region, or proceeding to the step (f) if nosuch candidate is present; (e) calling up the above described nucleotidesequence data for a candidate for the translation start codon of theORF-A found in the step (d), examining whether a ribosome binding siteis present from 1 to 30 bases upstream of each candidate, anddetermining its ribosome binding sequence if such a ribosome bindingsite is present, or determining that the ORF-A, whose translation startcodon is the candidate which corresponds to said ribosome bindingsequence, is a true coding region, determining that said ORF-A and ORF-Bform a single transcription unit, and writing the results of thedetermination into the memory; (f) calling up the above describednucleotide sequence data for an ORF-A which is not decided to form asingle transcription unit in the step (b) or for an ORF-A whosetranslation start codon can not be determined in the step (e), searchingfor up to the number N of candidates [here N is an integer from 5 to 20]for the translation start codon, including the provisional start codonwhich yields the longest ORF, from the 5′ terminal, examining whether aribosome binding site is present from 1 to 30 bases upstream of eachcandidate, determining its ribosome binding sequence if such a ribosomebinding site is present, determining that the ORF-A whose translationstart codon is the candidate corresponding to said ribosome bindingsequence is a true coding region, and writing the results of thedetermination into the memory; (g) repeating the above steps until allof the ORFs stored in the memory are processed; outputting, via anoutput device, the results of determination of transcription units andcoding regions in step (c), (e) or (f), which have been stored in thememory.
 26. The program according to claim 25, wherein the abovedescribed step (e) is: calling up the above described nucleotidesequence data; calculating a “ribosome binding score” which express thepaired state between a mRNA sequence of 4 to 17 bases upstream of acandidate for the translation start codon of the ORF-A and a sequence(3′-UUCCUCC-5′) involved in the binding to mRNA in a 16S rRNA 3′terminal sequence, or between a mRNA sequence of 4 to 16 bases upstreamof said candidate and a sequence (3′-UCCUCC-5′) involved in the bindingto mRNA within a 16S rRNA 3′ terminal sequence as a numerical valueaccording to the four rules described below: (1) A pairing of G and Cyields +4; (2) A pairing of A and U yields +2; (3) A pairing of G and Uyields +1; (4) When no pairing is present at a base pair which isadjacent to a base pair where a pairing is present, then this yields −1;maintaining a threshold value V₃ [herein V₃ is an integer from 4 to 12]for said ribosome binding score, determining that the above describedmRNA sequence whose ribosome binding score exceeds a threshold value V3is a ribosome binding sequence, and selecting the translation startcodon which corresponds to said ribosome binding sequence as thetranslation start codon of the ORF-A; dividing the “region around thevicinity of the translation stop codon of the ORF-B” into the two “theregion downstream of said vicinity” and “the region upstream of saidvicinity” if there is a plurality of said translation start codons forthe ORF-A, and selecting the candidate whose priority is highest,wherein the order of priority is the first “the region downstream ofsaid vicinity” and the second “the region upstream of said vicinity”;selecting the candidate whose priority is highest if a plurality oftranslation start codons is present within the respective regions,wherein the distance from the translation stop codon of the ORF-B isused as an indicator of priority; and determining that the ORF-A whosetranslation start codon is the selected candidate is a true codingregion, determining that said ORF-A and ORF-B form a singletranscription unit, and writing the results of the determination intothe memory.
 27. The program according to claim 25 or claim 26, whereinthe above described step (f) is: calling up the above describednucleotide sequence data; calculating a “ribosome binding score” whichexpress the paired state between a mRNA sequence of 4 to 17 basesupstream of a candidate for the translation start codon of the ORF-A anda sequence (3′-UUCCUCC-5′) involved in the binding to mRNA in a 16S rRNA3′ terminal sequence, or between a mRNA sequence of 4 to 16 basesupstream of said candidate and a sequence (3′-UCCUCC-5′) involved in thebinding to mRNA in a 16S rRNA 3′ terminal sequence as a numerical value,according to the four rules described below: (1) A pairing of G and Cyields +4; (2) A pairing of A and U yields +2; (3) A pairing of G and Uyields +1; (4) When no pairing is present at a base pair which isadjacent to a base pair where a pairing is present, then this yields −1;maintaining a threshold value V₁ for said ribosome binding score,determining that the above described mRNA sequence which exceeds thethreshold value V₁ is the ribosome binding sequence, and selecting acandidate for the translation start codon which corresponds to saidribosome binding sequence as the translation start codon of the ORF-A;selecting the translation start codon corresponding to the ribosomebinding sequence which yields the highest score as the translation startcodon of ORF-A if there is a plurality of said translation start codons;setting one or more threshold value(s) which is smaller than V₁ andinclude the threshold value V₃ in a stepwise manner if there is nocandidate which exceeds the threshold value V₁, searching for the abovedescribed mRNA sequence whose score exceeds said threshold value in astepwise manner, determining the ribosome binding sequence, andselecting the translation start codon which corresponds to said ribosomebinding sequence as the translation start codon of the ORF-A; anddetermining that the ORF-A whose translation start codon is the selectedcandidate is a true coding region, and writing the results of thedetermination into the memory [herein V₁ is an integer which is greaterthan the V₃ of claim 2, and which is between 7 and 14].
 28. The programaccording to claim 26 or claim 27, characterized in that the abovedescribed “ribosome binding score” is calculated by deducting anumerical value P_(G) if the translation start codon is GTG, and bydeducting a numerical value P_(T) if the translation start codon is TTG[herein, P_(G) is an integer from 1 to 4, and P_(T) is an integer from 2to 6].
 29. A program for executing the following steps on a computer:calling up the data for transcription units and coding regions stored inthe memory after the above described step (g) in the program accordingto claim 25 to claim 28; (h) deciding that the transcription unit Q orthe coding region B is a “false transcription unit” or a “false codingregion” if a transcription unit Q or a coding region B which is presentupon the same strand as a transcription unit P or a coding region A isincluded in the transcription unit P or the coding region A; (i)deciding that the transcription unit Q or the coding region B is a“false transcription unit” or a “false coding region” if a transcriptionunit Q or a coding region B which is present upon the complementarystrand to a transcription unit P or a coding region A is included in thetranscription unit P or the coding region A; (j) deciding that thetranscription unit or coding region whose length is shorter is a “falsetranscription unit” or a “false coding region” if a transcription unit Por a coding region A overlaps with a transcription unit Q or a codingregion B which is present upon the complementary strand; and outputtingthe results of the above described decision via an output device.
 30. Aprogram for executing the following steps on a computer: (k)investigating the type of the codons and the number thereof, which areutilized in a plurality (T) of the coding regions of the prokaryotewhich regions are determined and inputted via an input means, selectingk types of combination of codons among them wherein “the frequency ofappearance of one codon is high, and the frequency of appearance of acodon which has the complementary sequence of the 3-base sequence ofsaid codon is low”, and storing the codons in the memory; (l) measuringthe frequency of appearance of the selected codons in a coding region Awhich is assumed to be the coding region from the data of said codingregion A inputted via an input means, comparing the “number of times ofthe k types of codons whose frequency of appearance is high appearing ina coding region A which is assumed to be a coding region” with the“number of times of the k types of codons whose frequency of appearanceis low appearing in said coding region A”, and deciding on the truth orfalsity of said coding region A [herein k is an integer greater than orequal to 5 and less than or equal to 20]; and displaying the results ofthe above described decision via an output device.
 31. The programaccording to claim 30, wherein the step (1) is comparing the “number oftimes of the k types of codons whose frequency of appearance is highappearing in a coding region A which is assumed to be the coding region”and the “number of times of the k types of codons whose frequency ofappearance is low appearing in said coding region A” by using “thereciprocal of the sum of 1 and the ratio of the number of the latter tothe number of the former” as a calculation formula, and deciding thatsaid coding region A is a “false coding region” if the value of saidreciprocal is less than a fixed value.
 32. The program for executing thefollowing steps on a computer according to claim 30: (m) constructing acodon table by arranging the 64 types of codons so that the 3-basesequence of the i-th codon has the complementary sequence to thenucleotide sequence of the (i+32)-th codon, and storing the codon tablein the memory; (n) inputting the nucleotide sequence of the number T ofdetermined coding regions of a prokaryote, and obtaining yi from theformula (2) below and yi+32 from the formula (3) below: $\begin{matrix}{y_{i} = {\left( {{\sum\limits_{t = 1}^{T}\quad C_{i}^{t}} - {\sum\limits_{t = 1}^{T}\quad C_{i + 32}^{t}}} \right)/{\sum\limits_{t = 1}^{T}{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}} & (2) \\{y_{i + 32} = {\left( {{\sum\limits_{t = 1}^{T}\quad C_{i + 32}^{i}} - {\sum\limits_{t = 1}^{T}\quad C_{i}^{t}}} \right)/{\sum\limits_{t = 1}^{T}{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}} & (3)\end{matrix}$ wherein the number of times the i-th codon appear in thet-th coding region is expressed asC^(t) _(j) (o) calling up the codon table which was obtained in the step(m) from the memory, setting up a correspondence between the y_(i) andy_(i+32) for the codons in the table, rearranging the sequence of thecodons in the table in descending order of the y_(i) and the y_(i+32),selecting top k codons for which the value of y_(i) or of y_(i+32) islarge, and obtaining the value of Sd_(A) for a coding region A by thefollowing formula (4): $\begin{matrix}{{Sd}_{A} = {2 \times {\sum\limits_{i = 1}^{k}\quad{C_{i}^{A}/\left( {{\sum\limits_{i = 1}^{k}\quad C_{i}^{A}} + {\sum\limits_{i = {65 - k}}^{64}\quad C_{i}^{A}}} \right)}}}} & (4)\end{matrix}$ [herein the value of Sd_(A) is defined as 1 if$\left( {{\sum\limits_{i = 1}^{k}\quad C_{i}^{A}} + {\sum\limits_{i = {65 - k}}^{64}\quad C_{i}^{A}}} \right)$is zero]; (p) deciding that said coding region is a true coding regionif the value of Sd_(A) of a coding region A obtained in the abovedescribed step is greater than or equal to a threshold value S₁, anddeciding that it is a false coding region if said value of Sd_(A) isless than the threshold value S₁ [herein T is an integer greater than orequal to 2, i is a positive integer less than or equal to 32, j is apositive integer less than or equal to 64, t is a positive integer lessthan or equal to T, k is an integer from 5 to 20, and S₁ is a value from0.8 to 1.8].
 33. A program for executing the following steps on acomputer: examining whether there is mutual overlapping and inclusionbetween coding regions of a prokaryote which are inputted via an inputdevice: (q) calling up the above described nucleotide sequence data if acoding region or a coding region A which is assumed to be a codingregion overlaps with a coding region B which is assumed to be a codingregion and present upon the complementary strand, and if said codingregion B is included in said coding region A, comparing the length L_(B)(in base pairs) of said coding region B with the length L_(A) (in basepairs) of said coding region A, and deciding that said coding region Bis a “false coding region” if L_(B) is less than or equal to T_(P) % ofL_(A); (r) deciding on the truth or falsity of said coding region A andof said coding region B by the steps of the program according to any oneof claim 30 to claim 32 if L_(B) exceeds T_(P) % of L_(A) [herein T_(P)is a positive integer from 30 to 95].
 34. The program according to claim33, characterized by rewriting the data for the determined codingregions to a single coding region constructed by removing thetranslation stop codons from the coding regions which form atranscription unit and by linking up the resulting coding regions fromsaid data, before executing the steps (k) and (l) described above.
 35. Aprogram for deciding on the truth or falsity of a coding region or of atranscription unit which is determined and stored in the memory in anyone of claim 25 to claim 35, by the steps of the program according toany one of claim 30 to claim
 34. 36. A program for executing the steps:calling up the data for coding regions which is determined as truecoding regions by the steps of the program according to any one of claim25 to claim 35 from the memory, calculating the length of thepolypeptide encoded by each coding region, and deciding on the truth orfalsity of the coding regions which encode the polypeptides of L_(M)amino acids or more in length, by using the program according to any oneof claim 7 to claim 12, based upon the nucleotide sequences of thecoding regions encoding the polypeptide of L_(F) amino acids or more inlength [herein L_(F) is a positive integer greater than or equal to 100,and L_(M) is a positive integer greater than or equal to 20].
 37. Aprogram for executing the following steps on a computer: calculating thecontent of the first and third G residues and C residues of the codonsin a coding region of a prokaryote whose GC content exceeds 50% by usinga predetermined calculation formula from the data for said coding regioninputted via an input device; deciding that said coding region is a“false coding region” if the calculated content is less than a fixedvalue; and outputting the results of the decision via an output device.38. The program according to claim 37, wherein the following formula (5)is used as a calculation formula, the value of GC_(i) described below isused as a calculated content, and one value which is selected from 0.6to 0.75 is used as a fixed value: $\begin{matrix}{{GC}_{i} = {{\left( {y^{i{(1)}} + y^{i{(3)}}} \right)/{\sum\limits_{r = 1}^{3}\quad{y^{i{(r)}}\quad{wherein}\quad y^{i{(r)}}}}} = {\sum\limits_{n = 1}^{N_{i}}\quad{\sum\limits_{b = 1}^{4}x_{n{(b)}}^{i{(r)}}}}}} & (5)\end{matrix}$ [herein when the r-th base (r is 1, 2, or 3) of the n-thcodon of the i-th coding region is b (b is 1, 2, 3, or 4), thenx_(n(b))^(i(r))  is  x_(n(b))^(i(r)) = 1_((b = 1  or  2))  x_(n(b))^(i(r)) = 0_((b = 3  or  4)) and, as for b, when the r-th baseof the n-th codon of the i-th coding region is G, C, A, or T, then b is1, 2, 3, or 4, respectively, i and n are positive integers, and N_(i)denotes the total number of the codons (excluding the translation stopcodon) of the i-th coding region].
 39. A program for executing thefollowing steps on a computer: calculating the content of the first andthird G residues and C residues of the codons of the 5′ terminal regionof a coding region of a prokaryote whose GC content exceeds 50% by usinga predetermined calculation formula, from the data for said codingregion inputted via an input device; deciding that the translation startcodon of said coding region is a “false translation start codon” if thecalculated content is less than a fixed value, and outputting theresults of this decision via an output device; calling up the nucleotidesequence data of the above described coding region which is inputted viaan input device, and re-searching for a translation start codon which ispresent downstream of said translation start codon decided to be false.40. The program according to claim 39, wherein the following formula (5)is used as an calculation formula, the value of GC_(i) described belowis used as a calculated content and one value selected from 0.6 to
 0. 75is used as a fixed value: $\begin{matrix}{{GC}_{i} = {{\left( {y^{i{(1)}} + y^{i{(3)}}} \right)/{\sum\limits_{r = 1}^{3}\quad{y^{i{(r)}}\quad{wherein}\quad y^{i{(r)}}}}} = {\sum\limits_{n = 1}^{N_{i}}\quad{\sum\limits_{b = 1}^{4}x_{n{(b)}}^{i{(r)}}}}}} & (5)\end{matrix}$ [herein, when the r-th base (r is 1, 2, or 3) of the n-thcodon of the i-th coding region is b (b is 1, 2, 3, or 4), thenx_(n(b))^(i(r))  is  x_(n(b))^(i(r)) = 1_((b = 1  or  2))  x_(n(b))^(i(r)) = 0_((b = 3  or  4)) and, as for b, when the r-th baseof the n-th codon of the i-th coding region is G, C, A, or T, b is 1, 2,3, or 4, respectively, i and n are positive integers, and N_(i) denotesthe total number of the codons (excluding the translation stop codon) ofthe i-th coding region].
 41. A program for executing the following stepson a computer: selecting m types of codons whose frequency of appearanceis high and m types of codons whose frequency of appearance is low inthe number T of the coding regions of a prokaryote which are determinedby the steps of the program according to any one of claims 25 to 40, andstoring the codons in the memory; measuring the “number of times of them types of codons whose frequency of appearance in the number T of thecoding regions is high appearing in the coding region A which is assumedto be the coding region” and the “number of times of the m types ofcodons whose frequency of appearance in the T coding regions is lowappearing in said coding region A” from the data for coding regionswhich are not determined, different from the number T of the codingregions and inputted; deciding on the truth or the falsity of saidcoding region A by comparing both numbers; and outputting the results ofthe decision via an output device. [herein T is an integer greater thanor equal to 2, and m is an integer greater than or equal to 5 and lessthan or equal to 20].
 42. The program according to claim 41, wherein themethod of comparing the “number of times of the m types of codons whosefrequency of appearance is high appearing in the coding region A whichis assumed to be the coding region” with the “number of times of the mtypes of codons whose frequency of appearance is low appearing in saidcoding region A” is the method which utilizes the “reciprocal of the sumof 1 and the ratio of the number of the latter to the number of theformer” a calculation formula, and which decides that said coding regionA is a “false coding region” if the value of said reciprocal is lessthan a fixed value [herein m is an integer greater than or equal to 5and less than or equal to 20].
 43. The program for executing thefollowing steps on a computer according to claim 41: (m) constructing acodon table in which the 64 types of codons are arranged so that the3-base sequence of the i-th codon has a complementary sequence to thenucleotide sequence of the (i+32)-th codon, and storing the codon tablein the memory; (s) obtaining y_(i) by the following formula (6):$\begin{matrix}{y_{i} = {\sum\limits_{t = 1}^{T}\quad{C_{i}^{t}/{\sum\limits_{t = 1}^{T}{\sum\limits_{j = 1}^{64}\quad C_{j}^{t}}}}}} & (6)\end{matrix}$ wherein the number of times of the i-th codon appearing inthe t-th coding region is expressed asC^(t) _(j) (t) calling up the codon table from the memory, rearrangingthe 64 types of codon in descending order of y_(i), selecting “top mcodons for which the value of y_(i) is large” and “bottom m codons forwhich the value of y_(i) is large, excluding the translation stopcodon”, and obtaining the value of Cd_(A) for the coding region A whichis assumed to be the coding region from the following formula (7):$\begin{matrix}{{Cd}_{A} = {2 \times {\sum\limits_{i = 1}^{m}\quad{C_{i}^{A}/\left( {{\sum\limits_{i = 1}^{m}\quad C_{i}^{A}} + {\sum\limits_{i = {62 - m}}^{61}\quad C_{i}^{A}}} \right)}}}} & (7)\end{matrix}$ [herein the value of Cd_(A) is defined as 1 if$\left( {{\sum\limits_{i = 1}^{m}\quad C_{1}^{A}} + {\sum\limits_{i = {62 - m}}^{61}\quad C_{i}^{A}}} \right)$is zero, ] and (u) deciding said coding region A to be a true codingregion if the value of Cd_(A) for said coding region A which iscalculated by the step (t) is greater than or equal to a threshold valueCV, deciding said coding region A to be a false coding region if saidvalue of Cd_(A) is less than the threshold value CV, and outputting thedecision results via an output device [herein T is an integer greaterthan or equal to 2; i is a positive integer less than or equal to 64; jis a positive integer less than or equal to 64; t is a positive integerless than or equal to T, m is an integer from 5 to 20; and CV is a valuefrom 0.8 to 1.8].
 44. A program for executing the following steps on acomputer, wherein a coding region of the prokaryote or a coding region Awhich is assumed to be a coding region overlaps with a coding region Bwhich is assumed to be a coding region and present upon thecomplementary strand, and said coding region B is included in saidcoding region A: (v) comparing the length L_(B) (in base pairs) of thecoding region B with the length L_(A) (in base pairs) of the codingregion A, and deciding that the coding region B is a “false codingregion” if L_(B) is less than or equal to T_(P) % of L_(A); (w) decidingon the truth or falsity of said coding region A and of said codingregion B by the method of determining a genetic structure according toany one of claim 41 to claim 43 if L_(B) exceeds T_(P) % of L_(A),[herein, T_(P) is a positive integer from 30 to 95]; and outputting theresults of the decision via an output device.
 45. A program executingthe following steps: removing translation stop codons from the codingregions which form a transcription unit from the data for determinedcoding regions; linking up the resulting coding regions into a singlecoding region; and rewriting the data for determined coding regions tothe resulting single coding region; and executing the steps of theprogram according to any one of claim 41 to claim
 44. 46. Acomputer-readable recording medium on which the program according to anyone of claim 25 to claim 45 is recorded.
 47. A system for determining agenetic structure which comprises: (i) an input means for inputtingnucleotide sequence data; (ii) a means for executing the programaccording to any one of claim 25 to claim 45, using the inputted data;and (iii) an output device for outputting the results which is obtainedby (ii).